On 2023-02-17 18:08:09 -0500, avi.e.gr...@gmail.com wrote: > Analogies I am sharing are mainly for me to wrap my head around an idea by > seeing if it matches any existing ideas or templates and is not meant to be > exact. Fair enough?
Yeah. But if you are venting your musings into a public space you shouldn't be surprised if people react to them. And we can only react to what you write, not what you think. > But in this case, from my reading, the analogy is rather reasonable. Although that confused me a bit. You are clearly responding to something I thought about but which you didn't quote below. Did I just think about it and not write it, but you responded anyway because you're a mind reader? Nope, it just turns out you (accidentally) deleted that sentence. > The implementation of Roaring Bitmaps seems to logically break up the > space of all possible values it looks up into multiple "zones" that > are somewhat analogous to individual files, That part is correct. But you presented that in the form of a performance/space tradeoff, writing about "trying multiple methods" to find the smallest, and that that makes compression slower. That may be the case for pkzip, but it's not what RB does: Instead it uses a very simple heuristic: If there are less than 25% of the bits set in a zone, it uses a list of offsets, otherwise a plain bitmap. (according to their 2016 paper which I just skimmed through again - maybe the implementation is a bit more complex now). So I think your description would lead the reader to anticipate problems which aren't there and probably miss ones which are there. So I'll stay with my "not completely wrong but not very helpful" assessment. > I did not raise the issue and thus have no interest in promoting this > technology nor knocking it down. Just wondering what it was under the hood > and whether I might even have a need for it. I am not saying Social Security > numbers are a fit, simply that some form of ID number might fit. Yeah, that's the point: Any form of ID which is a small-ish integer number fits. And maybe it's just because I work with databases a lot, but representing things with numeric ids (in relational databases we call them "surrogate keys") is such a basic operation that it doesn't warrant more than a sentence or two. > If a company has a unique ID number for each client and uses it > consistently, then an implementation that holds a set stored this way > of people using product A, such as house insurance, and those using > product B, such as car insurance, and perhaps product C is an umbrella > policy, might easily handle some queries such as who uses two or all > three (intersections of sets) or who might merit a letter telling them > how much they can save if they subscribed to two or all three as a way > to get more business. Again, just a made-up example I can think > about. A company which has a million customers over the years will > have fairly large sets as described. A much better example. This is indeed how you would use roaring bitmaps. > What is helpful to me in thinking about something will naturally often not > be helpful to you or others but nothing you wrote makes me feel my first > take was in any serious way wrong. It still makes sense to me. > > And FYI, the largest integer in signed 32 bits is 2_147_483_647 I know. It's been some time since I could do hexadecimal arithmetic in my head but the the limits of common data types are still burned into my brain ;-). > which is 10 digits. A Social Security number look like xxx-xx-xxxx at > this time which is only 9 digits. These are US social security numbers. Other countries have other schemes. For example, Austrian SSNs have 10 digits, so you would need 34 bits to represent them exactly. However, they (obviously) contain some redundancy (one of the digits is a checksum, and there aren't 999999 days in a century) so it could algorithmically be compressed down to 26 bits. But you probably wouldn't do that because in almost any real application you wouldn't use the SSN as a primary key (some people don't have one, and there have been mixups resulting in two people getting the same SSN). > As for my other EXAMPLE, I fail to see why I need to provide a specific need > for an application. I don't care what they need it for. The thought was > about whether something that does not start as an integer can be uniquely > mapped into and out of integers of size 32 bits. That's what confused me. You seemed to concentrate on the "map things to integers" part which has been solved for decades and is absolutely incidental to roaring bitmaps and completely ignored what you would be using them for. So I thought I was missing something, but it seems I wasn't. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | h...@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
-- https://mail.python.org/mailman/listinfo/python-list