Re: Fuzzy Counter?
Ian Kelly wrote: On Tue, Sep 23, 2014 at 11:01 PM, Miki Tebeka miki.teb...@gmail.com wrote: On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote: While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. The problem for me here is that I can't determine the number of bins in advance. I'd like to get frequencies. I guess every new (don't have any previous equal item) can be a bin. Then your result depends on the order of your input, which is usually not a good thing. Why would you need to determine the *number* of bins in advance? You just need to determine where they start and stop. If for example your epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5); [0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you encounter, you could calculate the appropriate bin, creating it first if it doesn't already exist. That has the unfortunate implication that: 0.50001 and 1.4 (delta = 0.8) are considered equal, but: 1.50001 and 1.4 (delta = 0.2) are considered unequal. -- Steven -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
Greetings, On Wednesday, September 24, 2014 5:57:15 PM UTC+3, Ian wrote: Then your result depends on the order of your input, which is usually not a good thing. As stated in previous reply - I'm OK with that. Why would you need to determine the *number* of bins in advance? You just need to determine where they start and stop. If for example your epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5); [0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you encounter, you could calculate the appropriate bin, creating it first if it doesn't already exist. I see what you mean. I thought you mean histogram like bins where you usually state the number of bins in advance. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On 09/23/2014 09:32 AM, Rob Gaddi wrote: On Tue, 23 Sep 2014 05:34:19 -0700 (PDT) Miki Tebeka wrote: Before I start writing my own. Is there something like collections.Counter (fore frequencies) that does fuzzy matching? Meaning x is considered equal to y if abs(x - y) epsilon. (x, y and my case will be numpy.array). You'll probably have to write that yourself. While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. You might want to take a look at the reference implementation for PEP 455 [1]. If you can decide on a method to transform your keys (such as taking the floor, or the half, or something like that), then that should work as is. -- ~Ethan~ [1] http://legacy.python.org/dev/peps/pep-0455/ -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Wed, Sep 24, 2014, at 00:57, Miki Tebeka wrote: On Tuesday, September 23, 2014 4:37:10 PM UTC+3, Peter Otten wrote: x eq y y eq z not (x eq z) where eq is the test given above -- should x, y, and z land in the same bin? Yeah, I know the counting depends on the order of items. But I'm OK with that. It doesn't just depend on the order. If you put x and z in first (creating two bins), then which one does y go in after? -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Tue, 23 Sep 2014 22:01:51 -0700 (PDT) Miki Tebeka miki.teb...@gmail.com wrote: On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote: While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. The problem for me here is that I can't determine the number of bins in advance. I'd like to get frequencies. I guess every new (don't have any previous equal item) can be a bin. TL;DR you need to think very hard about your problem definition and what you want to happen before you actually try to implement this. Always a good advice :) I'm actually implementing algorithm for someone else (in the bio world where I know very little about). See, THERE's your problem. You've got a scientist trying to make prescriptions for an engineering problem. He's given you a fuzzy description of the sort of thing he's trying to do. Your job is to turn that fuzzy description into a concrete, actual algorithm before you even write a single line of code, which means understanding what the data is, and what the desired result of that data is. Because the thing you keep trying to do, with all of its order dependencies fundamentally CANNOT be right, regardless of what the squishy scientist tells you. The histogram bin solution that everyone keeps trying to steer you towards is almost certainly what you really want. Epsilon is your resolution. You cannot resolve any information below your resolution limit. Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and 2.49 are in the same one, but that's what it means to have a resolution of 1; you can't say anything about whether any given count in the 2, plus or minus a bit bin is very nearly 1 or very nearly 3. This doesn't require you to know the number of bins in advance, you can just create and fill them as needed. That said, you're trying to solve a physical problem, and so it has physical limits. Your biologist should be able to give you an order of magnitude estimate of how many bins you're expecting, and what the ultimate shape is expected to look like. Normally distributed? Wildly bimodal? Is the overall span of data going to span 10 epsilon or 10,000 epsilon? If there are going to be a ton of bins, you may be better served by putting 1/3 of a count into bins n-1, n, and n+1 rather than just in bin n; it's the equivalent of squinting a bit when you look at the bins. But you have to understand the problem to solve it. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Fri, Sep 26, 2014, at 14:30, Rob Gaddi wrote: The histogram bin solution that everyone keeps trying to steer you towards is almost certainly what you really want. Epsilon is your resolution. You cannot resolve any information below your resolution limit. Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and 2.49 are in the same one, but that's what it means to have a resolution of 1; you can't say anything about whether any given count in the 2, plus or minus a bit bin is very nearly 1 or very nearly 3. You could antialias the values, though. 1.49 results in a value that is 51% in the 1 bin, and 49% in the 2 bin. count[1] += 0.51, count[2] += 0.49. You could even spread each value across a larger number of smaller bins. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Fri, 26 Sep 2014 15:10:43 -0400 random...@fastmail.us wrote: On Fri, Sep 26, 2014, at 14:30, Rob Gaddi wrote: The histogram bin solution that everyone keeps trying to steer you towards is almost certainly what you really want. Epsilon is your resolution. You cannot resolve any information below your resolution limit. Yes, 1.49 and 1.51 wind up in different bins, whereas 1.51 and 2.49 are in the same one, but that's what it means to have a resolution of 1; you can't say anything about whether any given count in the 2, plus or minus a bit bin is very nearly 1 or very nearly 3. You could antialias the values, though. 1.49 results in a value that is 51% in the 1 bin, and 49% in the 2 bin. count[1] += 0.51, count[2] += 0.49. You could even spread each value across a larger number of smaller bins. Right, but there's still that stateless determination of which bin (or bins) 1.49 goes in. The history of the bins is irrelevant, which is the important part. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Tue, Sep 23, 2014 at 11:01 PM, Miki Tebeka miki.teb...@gmail.com wrote: On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote: While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. The problem for me here is that I can't determine the number of bins in advance. I'd like to get frequencies. I guess every new (don't have any previous equal item) can be a bin. Then your result depends on the order of your input, which is usually not a good thing. Why would you need to determine the *number* of bins in advance? You just need to determine where they start and stop. If for example your epsilon is 0.5, you could determine the bins to be at [-0.5, 0.5); [0.5, 1.5); [1.5, 2.5); ad infinitum. Then for each actual value you encounter, you could calculate the appropriate bin, creating it first if it doesn't already exist. -- https://mail.python.org/mailman/listinfo/python-list
Fuzzy Counter?
Greetings, Before I start writing my own. Is there something like collections.Counter (fore frequencies) that does fuzzy matching? Meaning x is considered equal to y if abs(x - y) epsilon. (x, y and my case will be numpy.array). Thanks, -- Miki -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
Miki Tebeka wrote: Before I start writing my own. Is there something like collections.Counter (fore frequencies) that does fuzzy matching? Meaning x is considered equal to y if abs(x - y) epsilon. (x, y and my case will be numpy.array). The problem I see with that description is that for x, y, z, assumming x eq y y eq z not (x eq z) where eq is the test given above -- should x, y, and z land in the same bin? I'm not really qualified here, but k-means clustering may be a direction to investigate: http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans.html -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On 2014-09-23 05:34, Miki Tebeka wrote: Before I start writing my own. Is there something like collections.Counter (fore frequencies) that does fuzzy matching? Meaning x is considered equal to y if abs(x - y) epsilon. (x, y and my case will be numpy.array). Not that I know of -- the performance characteristics would differ radically from the O(1) expectations that Counter() gives you since you'd have to compare your value against *every* value in the container[*], and you have the possibility that two items in the container are equidistant yet within EPSILON: EPSILON = 0.51 my_counter[1.0] += 1 my_counter[2.0] += 1 my_counter[1.5] += 1 # which one should be incremented? -tkc [*] This could be mitigated if you had some constraints on your inputs and wanted to do some sort of bucket-hashing, limiting it to a subset of the possible keys -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Tue, 23 Sep 2014 05:34:19 -0700 (PDT) Miki Tebeka miki.teb...@gmail.com wrote: Greetings, Before I start writing my own. Is there something like collections.Counter (fore frequencies) that does fuzzy matching? Meaning x is considered equal to y if abs(x - y) epsilon. (x, y and my case will be numpy.array). Thanks, -- Miki You'll probably have to write that yourself. While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. If, for instance, you can live with 1.anything being one bin, and 2.anything being the next (even though this puts 1.999 and 2.000 into separate bins) then you can just floor() the number and use the result as the key. If you really need the continuous fuzziness you'll probably have to keep the potential keys sorted and run a binary search against the list of them to find the nearest. This will still run you into some problems based on ordering. For instance, with an epsilon of 0.1, you'd put 2.0 into a bin, then 1.9 into the same bin, then 2.1 into the same bin. If however you put 1.9 into that bin first, then 2.0 would go into that bin, but 2.1 would go into a different one. TL;DR you need to think very hard about your problem definition and what you want to happen before you actually try to implement this. -- Rob Gaddi, Highland Technology -- www.highlandtechnology.com Email address domain is currently out of order. See above to fix. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Wed, Sep 24, 2014 at 2:32 AM, Rob Gaddi rgaddi@technologyhighland.invalid wrote: You'll probably have to write that yourself. While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. If, for instance, you can live with 1.anything being one bin, and 2.anything being the next (even though this puts 1.999 and 2.000 into separate bins) then you can just floor() the number and use the result as the key. Or even check three bins based on that - the floor(), floor()+1, and floor()-1 bins. That way, you're not running into problems at any boundaries, and you still limit the exponential growth of the comparisons. But it's still fundamentally problematic. ChrisA -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Tuesday, September 23, 2014 4:37:10 PM UTC+3, Peter Otten wrote: x eq y y eq z not (x eq z) where eq is the test given above -- should x, y, and z land in the same bin? Yeah, I know the counting depends on the order of items. But I'm OK with that. -- https://mail.python.org/mailman/listinfo/python-list
Re: Fuzzy Counter?
On Tuesday, September 23, 2014 7:33:06 PM UTC+3, Rob Gaddi wrote: While you're at it, think long and hard about that definition of fuzziness. If you can make it closer to the concept of histogram bins you'll get much better performance. The problem for me here is that I can't determine the number of bins in advance. I'd like to get frequencies. I guess every new (don't have any previous equal item) can be a bin. TL;DR you need to think very hard about your problem definition and what you want to happen before you actually try to implement this. Always a good advice :) I'm actually implementing algorithm for someone else (in the bio world where I know very little about). -- https://mail.python.org/mailman/listinfo/python-list