On Tue, 23 Sep 2014 22:01:51 -0700 (PDT)
Miki Tebeka <miki.teb...@gmail.com> wrote:

> On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote:
> 
> > While you're at it, think
> > long and hard about that definition of fuzziness.  If you can make it
> > closer to the concept of histogram "bins" you'll get much better
> > performance.  
> The problem for me here is that I can't determine the number of bins in 
> advance. I'd like to get frequencies. I guess every "new" (don't have any 
> previous equal item) can be a bin.
> 
> > TL;DR you need to think very hard about your problem definition and
> > what you want to happen before you actually try to implement this.
> Always a good advice :) I'm actually implementing algorithm for someone else 
> (in the bio world where I know very little about).

See, THERE's your problem.  You've got a scientist trying to make
prescriptions for an engineering problem.  He's given you a fuzzy
description of the sort of thing he's trying to do.  Your job is to
turn that fuzzy description into a concrete, actual algorithm
before you even write a single line of code, which means understanding
what the data is, and what the desired result of that data is.  Because
the thing you keep trying to do, with all of its order dependencies
fundamentally CANNOT be right, regardless of what the squishy scientist
tells you.

The "histogram" bin solution that everyone keeps trying to steer you
towards is almost certainly what you really want.  Epsilon is your
resolution.  You cannot resolve any information below your resolution
limit.  Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and
2.49 are in the same one, but that's what it means to have a resolution
of 1; you can't say anything about whether any given count in the "2,
plus or minus a bit" bin is very nearly 1 or very nearly 3.

This doesn't require you to know the number of bins in advance, you can
just create and fill them as needed.  That said, you're trying to solve
a physical problem, and so it has physical limits.  Your biologist
should be able to give you an order of magnitude estimate of how many
"bins" you're expecting, and what the ultimate shape is expected to
look like.  Normally distributed?  Wildly bimodal?  Is the overall span
of data going to span 10 epsilon or 10,000 epsilon?  If there are going
to be a ton of bins, you may be better served by putting 1/3 of a count
into bins n-1, n, and n+1 rather than just in bin n; it's the
equivalent of squinting a bit when you look at the bins.

But you have to understand the problem to solve it.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to