Re: Fuzzy Counter?

2014-09-26 Thread Steven D'Aprano
Ian Kelly wrote:

 On Tue, Sep 23, 2014 at 11:01 PM, Miki Tebeka miki.teb...@gmail.com
 wrote:
 On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote:

 While you're at it, think
 long and hard about that definition of fuzziness.  If you can make it
 closer to the concept of histogram bins you'll get much better
 performance.
 The problem for me here is that I can't determine the number of bins in
 advance. I'd like to get frequencies. I guess every new (don't have any
 previous equal item) can be a bin.
 
 Then your result depends on the order of your input, which is usually
 not a good thing.
 
 Why would you need to determine the *number* of bins in advance? You
 just need to determine where they start and stop. If for example your
 epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5);
 [0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you
 encounter, you could calculate the appropriate bin, creating it first
 if it doesn't already exist.

That has the unfortunate implication that:

0.50001 and 1.4 (delta = 0.8)

are considered equal, but:

1.50001 and 1.4 (delta = 0.2)

are considered unequal.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread Miki Tebeka
Greetings,

On Wednesday, September 24, 2014 5:57:15 PM UTC+3, Ian wrote:
 Then your result depends on the order of your input, which is usually
 not a good thing.
As stated in previous reply - I'm OK with that.

 Why would you need to determine the *number* of bins in advance? You
 just need to determine where they start and stop. If for example your
 epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5);
 [0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you
 encounter, you could calculate the appropriate bin, creating it first
 if it doesn't already exist.
I see what you mean. I thought you mean histogram like bins where you usually 
state the number of bins in advance.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread Ethan Furman

On 09/23/2014 09:32 AM, Rob Gaddi wrote:

On Tue, 23 Sep 2014 05:34:19 -0700 (PDT) Miki Tebeka wrote:


Before I start writing my own. Is there something like collections.Counter (fore 
frequencies) that does fuzzy matching?

Meaning x is considered equal to y if abs(x - y)  epsilon. (x, y and my case 
will be numpy.array).


You'll probably have to write that yourself.  While you're at it, think
long and hard about that definition of fuzziness.  If you can make it
closer to the concept of histogram bins you'll get much better
performance.


You might want to take a look at the reference implementation for PEP 455 [1].  If you can decide on a method to 
transform your keys (such as taking the floor, or the half, or something like that), then that should work as is.


--
~Ethan~


[1] http://legacy.python.org/dev/peps/pep-0455/
--
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread random832
On Wed, Sep 24, 2014, at 00:57, Miki Tebeka wrote:
 On Tuesday, September 23, 2014 4:37:10 PM UTC+3, Peter Otten wrote:
  x eq y 
  y eq z
  not (x eq z)
  
  where eq is the test given above -- should x, y, and z land in the same bin?
 Yeah, I know the counting depends on the order of items. But I'm OK with
 that.

It doesn't just depend on the order. If you put x and z in first
(creating two bins), then which one does y go in after?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread Rob Gaddi
On Tue, 23 Sep 2014 22:01:51 -0700 (PDT)
Miki Tebeka miki.teb...@gmail.com wrote:

 On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote:
 
  While you're at it, think
  long and hard about that definition of fuzziness.  If you can make it
  closer to the concept of histogram bins you'll get much better
  performance.  
 The problem for me here is that I can't determine the number of bins in 
 advance. I'd like to get frequencies. I guess every new (don't have any 
 previous equal item) can be a bin.
 
  TL;DR you need to think very hard about your problem definition and
  what you want to happen before you actually try to implement this.
 Always a good advice :) I'm actually implementing algorithm for someone else 
 (in the bio world where I know very little about).

See, THERE's your problem.  You've got a scientist trying to make
prescriptions for an engineering problem.  He's given you a fuzzy
description of the sort of thing he's trying to do.  Your job is to
turn that fuzzy description into a concrete, actual algorithm
before you even write a single line of code, which means understanding
what the data is, and what the desired result of that data is.  Because
the thing you keep trying to do, with all of its order dependencies
fundamentally CANNOT be right, regardless of what the squishy scientist
tells you.

The histogram bin solution that everyone keeps trying to steer you
towards is almost certainly what you really want.  Epsilon is your
resolution.  You cannot resolve any information below your resolution
limit.  Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and
2.49 are in the same one, but that's what it means to have a resolution
of 1; you can't say anything about whether any given count in the 2,
plus or minus a bit bin is very nearly 1 or very nearly 3.

This doesn't require you to know the number of bins in advance, you can
just create and fill them as needed.  That said, you're trying to solve
a physical problem, and so it has physical limits.  Your biologist
should be able to give you an order of magnitude estimate of how many
bins you're expecting, and what the ultimate shape is expected to
look like.  Normally distributed?  Wildly bimodal?  Is the overall span
of data going to span 10 epsilon or 10,000 epsilon?  If there are going
to be a ton of bins, you may be better served by putting 1/3 of a count
into bins n-1, n, and n+1 rather than just in bin n; it's the
equivalent of squinting a bit when you look at the bins.

But you have to understand the problem to solve it.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread random832
On Fri, Sep 26, 2014, at 14:30, Rob Gaddi wrote:
 The histogram bin solution that everyone keeps trying to steer you
 towards is almost certainly what you really want.  Epsilon is your
 resolution.  You cannot resolve any information below your resolution
 limit.  Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and
 2.49 are in the same one, but that's what it means to have a resolution
 of 1; you can't say anything about whether any given count in the 2,
 plus or minus a bit bin is very nearly 1 or very nearly 3.

You could antialias the values, though. 1.49 results in a value that
is 51% in the 1 bin, and 49% in the 2 bin. count[1] += 0.51,
count[2] += 0.49. You could even spread each value across a larger
number of smaller bins.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-26 Thread Rob Gaddi
On Fri, 26 Sep 2014 15:10:43 -0400
random...@fastmail.us wrote:

 On Fri, Sep 26, 2014, at 14:30, Rob Gaddi wrote:
  The histogram bin solution that everyone keeps trying to steer you
  towards is almost certainly what you really want.  Epsilon is your
  resolution.  You cannot resolve any information below your resolution
  limit.  Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and
  2.49 are in the same one, but that's what it means to have a resolution
  of 1; you can't say anything about whether any given count in the 2,
  plus or minus a bit bin is very nearly 1 or very nearly 3.
 
 You could antialias the values, though. 1.49 results in a value that
 is 51% in the 1 bin, and 49% in the 2 bin. count[1] += 0.51,
 count[2] += 0.49. You could even spread each value across a larger
 number of smaller bins.

Right, but there's still that stateless determination of which bin (or
bins) 1.49 goes in.  The history of the bins is irrelevant, which is
the important part.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-24 Thread Ian Kelly
On Tue, Sep 23, 2014 at 11:01 PM, Miki Tebeka miki.teb...@gmail.com wrote:
 On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote:

 While you're at it, think
 long and hard about that definition of fuzziness.  If you can make it
 closer to the concept of histogram bins you'll get much better
 performance.
 The problem for me here is that I can't determine the number of bins in 
 advance. I'd like to get frequencies. I guess every new (don't have any 
 previous equal item) can be a bin.

Then your result depends on the order of your input, which is usually
not a good thing.

Why would you need to determine the *number* of bins in advance? You
just need to determine where they start and stop. If for example your
epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5);
[0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you
encounter, you could calculate the appropriate bin, creating it first
if it doesn't already exist.
-- 
https://mail.python.org/mailman/listinfo/python-list


Fuzzy Counter?

2014-09-23 Thread Miki Tebeka
Greetings,

Before I start writing my own. Is there something like collections.Counter 
(fore frequencies) that does fuzzy matching?

Meaning x is considered equal to y if abs(x - y)  epsilon. (x, y and my case 
will be numpy.array).

Thanks,
--
Miki
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Peter Otten
Miki Tebeka wrote:

 Before I start writing my own. Is there something like collections.Counter
 (fore frequencies) that does fuzzy matching?
 
 Meaning x is considered equal to y if abs(x - y)  epsilon. (x, y and my
 case will be numpy.array).

The problem I see with that description is that for x, y, z, assumming 

x eq y 
y eq z
not (x eq z)

where eq is the test given above -- should x, y, and z land in the same bin?

I'm not really qualified here, but k-means clustering may be a direction to 
investigate:

http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Tim Chase
On 2014-09-23 05:34, Miki Tebeka wrote:
 Before I start writing my own. Is there something like
 collections.Counter (fore frequencies) that does fuzzy matching?
 
 Meaning x is considered equal to y if abs(x - y)  epsilon. (x, y
 and my case will be numpy.array).

Not that I know of -- the performance characteristics would differ
radically from the O(1) expectations that Counter() gives you since
you'd have to compare your value against *every* value in the
container[*], and you have the possibility that two items in the
container are equidistant yet within EPSILON:

  EPSILON = 0.51
  my_counter[1.0] += 1
  my_counter[2.0] += 1
  my_counter[1.5] += 1  # which one should be incremented?

-tkc


[*]
This could be mitigated if you had some constraints on your inputs
and wanted to do some sort of bucket-hashing, limiting it to a subset
of the possible keys





-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Rob Gaddi
On Tue, 23 Sep 2014 05:34:19 -0700 (PDT)
Miki Tebeka miki.teb...@gmail.com wrote:

 Greetings,
 
 Before I start writing my own. Is there something like collections.Counter 
 (fore frequencies) that does fuzzy matching?
 
 Meaning x is considered equal to y if abs(x - y)  epsilon. (x, y and my case 
 will be numpy.array).
 
 Thanks,
 --
 Miki

You'll probably have to write that yourself.  While you're at it, think
long and hard about that definition of fuzziness.  If you can make it
closer to the concept of histogram bins you'll get much better
performance.  If, for instance, you can live with 1.anything being one
bin, and 2.anything being the next (even though this puts 1.999 and
2.000 into separate bins) then you can just floor() the number and use
the result as the key.

If you really need the continuous fuzziness you'll probably have to
keep the potential keys sorted and run a binary search against the list
of them to find the nearest.  This will still run you into some
problems based on ordering.  For instance, with an epsilon of 0.1,
you'd put 2.0 into a bin, then 1.9 into the same bin, then 2.1 into the
same bin.  If however you put 1.9 into that bin first, then 2.0 would
go into that bin, but 2.1 would go into a different one.

TL;DR you need to think very hard about your problem definition and
what you want to happen before you actually try to implement this.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Chris Angelico
On Wed, Sep 24, 2014 at 2:32 AM, Rob Gaddi
rgaddi@technologyhighland.invalid wrote:
 You'll probably have to write that yourself.  While you're at it, think
 long and hard about that definition of fuzziness.  If you can make it
 closer to the concept of histogram bins you'll get much better
 performance.  If, for instance, you can live with 1.anything being one
 bin, and 2.anything being the next (even though this puts 1.999 and
 2.000 into separate bins) then you can just floor() the number and use
 the result as the key.

Or even check three bins based on that - the floor(), floor()+1, and
floor()-1 bins. That way, you're not running into problems at any
boundaries, and you still limit the exponential growth of the
comparisons.

But it's still fundamentally problematic.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Miki Tebeka
On Tuesday, September 23, 2014 4:37:10 PM UTC+3, Peter Otten wrote:
 x eq y 
 y eq z
 not (x eq z)
 
 where eq is the test given above -- should x, y, and z land in the same bin?
Yeah, I know the counting depends on the order of items. But I'm OK with that.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Fuzzy Counter?

2014-09-23 Thread Miki Tebeka
On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote:

 While you're at it, think
 long and hard about that definition of fuzziness.  If you can make it
 closer to the concept of histogram bins you'll get much better
 performance.  
The problem for me here is that I can't determine the number of bins in 
advance. I'd like to get frequencies. I guess every new (don't have any 
previous equal item) can be a bin.

 TL;DR you need to think very hard about your problem definition and
 what you want to happen before you actually try to implement this.
Always a good advice :) I'm actually implementing algorithm for someone else 
(in the bio world where I know very little about).
-- 
https://mail.python.org/mailman/listinfo/python-list