On 12/28/19 1:14 AM, Christopher Barker wrote:
On Fri, Dec 27, 2019 at 8:14 PM Richard Damon
<rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
> It is a well known axiom of computing that returning an *incorrect*
> result is a very bad thing.
There is also an axiom that you can only expect valid results if you
meet the operations pre-conditions.
sure.
Sometimes, being totally defensive in checking for 'bad' inputs costs
you too much performance.
it can, yes, there are no hard rules about anything.
The stated requirement on the statistics module is you feed it
'numbers', and a NaN is by definition Not a Number.
Sure, but NaN IS a part of the Python float type, and is can and will
show up once in a while. That is not the same as expecting median (ir
any other function is the module) to work with some arbitrary other
type. Practicality beats purity -- floats WILL be widely used in teh
module, they should be accommodated.
Here is the text in the docs:
"""
This module provides functions for calculating mathematical statistics
of numeric (Real-valued) data.
<snip>
Unless explicitly noted, these functions support int, float, Decimal
and Fraction. Behaviour with other types (whether in the numeric tower
or not) is currently unsupported.
"""
So this is pretty clear - the module is designed to work with int,
float, Decimal and Fraction -- so I think it's worth some effort to
well-support those. And both float and Decimal have a NaN of some sort.
But the documentation that you reference say it works with NUMBERS, and
NaN are explicitly NOT A NUMBER, so the statistic module specifically
hasn't made a claim that it will work with them.
Also, the general section says, unless explicitly noted, and median
makes an explicit reference to type that support order but not addition
needing to use a different function, which implies that a data type the
IS ordered and supports addition is usable with median (presumably some
user defined number class that acts close enough to other number classes
that the average of a and b is a value between them)
The Median function also implies that its inputs have the property of
having an order, as well as being able to be added (and if you
can't add
them, then you need to use median_lower or median_upper)
sure, but those are expected of the types above anyway. It doesn't
seem to me that this package is designed to work with ANY type that
has order and supports the operations needed by the function. For
instance, lists of numbers can be compared, so:
In [69]: stuff = [[1,2,3],
...: [4,6,1],
...: [8,9],
...: [4,5,6,7,8,9],
...: [5.0]
...: ]
In [70]:
In [70]: statistics.median(stuff)
Out[70]: [4, 6, 1]
Actually, since lists don't support addition in the manner requested,
median isn't appropriate, but perhaps it has meaning with
median_lower(). For example, if the lists represented hierarchical
references in a document (Chapter 1, Section 2, Paragraph 3) then the
median might have a meaning.
The fact that that worked is just a coincidence, it is not an
important part of the design that it does.
I will also point out that really the median function is a small
part of
the problem, all the x-tile based functions have the same issue,
absolutely, and pretty much all of them, though many will just give
you NaN as a result, which is much better than an arbitrary result.
x-tile base functions (like median is 50th percentile, upper and lower
quartile (25 and 75 percentile), the one provide in the module is
quantiles (which actually can compute any even grouping).
and
fundamentally it is a problem with sorted().
I don't think so. In fact, sorted() is explicitly designed to work
with any type that is consistently ordered (I'm not sure if they HAVE
to be "total ordered), and floats with NaNs are not that.
Most sort routines require that the data be at least define a partial
order and effectively define a total order based on considering that if
a < b is false and b < a is false then a and b form an equivalency class
that we don't care about order. Sets for instance with < being sub set
of, have a consistent ordering but don't form a consistent equivalency
class and thus don't sort properly.
As the statistics module is specifically designed to work with numbers
(and essentially only with numbers) it's the appropriate place to put
an accommodation for NaNs. Not to mention that while sorted() could be
adapted to do something more consistent with NaNs, since it is a
general purpose function, it's hard to know what the behavior should
be -- raise an Exception? remove them? put them all at the front or
back? Which makes sense depends entirely on the application. The
statistics module, on the other hand, is for statistics, so while
there are still multiple options, it's a lot easier to pick one and go
for it.
Has anyone tried to implement a version of these that checks for
inputs
that don't have a total order and seen what the performance impact is?
I think checking for total order is a red herring -- there really is a
reason to specifically deal with NaNs in floats (and decimals), not
ANY type that may not be total ordered.
Some of the problems is that fixing one issue doesn't come close to
fixing the problem. The stated problem is that newbie/casual programmers
get confused that some things don't work when you get NaNs in the mix.
Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that
4 can't be the median of that array.
I am not saying that we can't fix the problem (though I think I have
made a reasonable argument that it isn't a problem that MUST be fixed),
but more that a change in just median is the wrong spot to make this
fix. The real problem is that to the naive program when they do
something wrong and get NaNs into their data, get confused at some of
the strange answers that they can get. The fact that NaNs don't order is
one of the points of confusion, and rather than try to fix one by one
the various operations that are defined for sorted data to handle to
issue, why not go to the core issue and deal with the base sorting
operations. Perhaps even better would be a math mode (perhaps
unfortunately needed to be the default since we are trying to help
beginners) that isn't fully IEEE compliant but throws exceptions on the
errors that get us into the territory that causes the confusion. This
might actually not impact that many programs in real life, as how many
programs actually need to be generating NaNs as a result of calculations.
Testing for NaNs isn't trivial, as elsewhere it was pointed out
that how
you check is based on the type of the number you have (Decimals being
different from floats).
yes, that is unfortunate.
To be really complete, you want to actually
detect that you have some elements that don't form a total order with
the other elements.
as above, being "complete" isn't necessary here. And even if you were
complete, what would you DO with a general non-total-ordered type?
Raising an Exception would be OK, but any other option (like treat it
as a missing value) wouldn't make sense if you doin't know more about
the type.The
The only answer that I see that makes sense is Raising an Exception
(likely in sorted). You also probably can't be totally thorough in
checking, as checking completely associativity would be an N**3
operation which is way to slow. Likely you would live with just testing
that for each pair you check has exactly one of a < b, a == b, b < a
being true.
In many ways, half fixing the issue makes it worse, as improving
reliability can lead you to forget that you need to take care, so the
remaining issues catch you harder.
I can't think of an example of this -- it's a fine principle, but if
we were to make the entire statistics module do something reasonable
with Nans -- what exactly other issues would that hide?
In short: practicality beats purity:
- The fact that the module is designed to work with all the standard
number types doesn't mean it has to work with ANY type that supports
the operations needed for a given function.
- NaNs are part of the Python float and Decimal implementations -- so
they WILL show up once in a while. It would be good to handle them.
- NaNs can also be very helpful to indicate missing values -- this is
can actually be very handy for statistics calculations. So it could be
a nice feature to add -- that NaN means "missing value"
ASSUMING that NaNs represent missing data is just ONE possible
interpretation for it. Making it THE interpretation in a low level
package seems wrong. It also is an interpretation that is easy to create
with a simple helper function that takes one sequence and returns
another one where all the NaNs are removed. Other interpretations of
what a NaN should reflect aren't as easy to implement outside the
operation, so if you are going to pick an interpretation, it probably
should be something else that is hard to handle externally.
Also, in Python, because it is dynamically typed, would be used better
in my mind using something like None to indicate missing data in the
array. NaN was chosen is some language as a missing value because they
couldn't handle mixed type data arrays.
-CHB
--
Christopher Barker, PhD
-- h
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/VRDAF4HV4GSTTHSK7NM5KOLRB3QPOO72/
Code of Conduct: http://python.org/psf/codeofconduct/