Hi, I'm looking to match about 700 features, so 2,000 should be enough to account for the variable tagging. I guess it could be run again, for 5,000 primary key/value pairs to get a better sample of the planets feature tagging system.
cheers, sam On 10/24/10, Jochen Topf <joc...@remote.org> wrote: > On Thu, Oct 21, 2010 at 06:07:33PM -0500, Scott Crosby wrote: >> On Wed, Oct 13, 2010 at 2:38 AM, Jochen Topf <joc...@remote.org> wrote: >> >> > On Tue, Oct 12, 2010 at 10:58:34PM +0200, Sebastian Klein wrote: >> > > Jochen Topf wrote: >> > >> On Tue, Oct 05, 2010 at 04:46:26PM +0200, Pieren wrote: >> > >>> If I could have one request, it would be nice to see the amount of >> > different >> > >>> contributors using the same tag. This to distinguish between >> > >>> quantity >> > and >> > >>> popularity. I know it might be challenging since we should only >> > >>> count >> > the >> > >>> user of the tag creation in the element history... >> > >> >> > >> On http://taginfo.openstreetmap.de/keys there is a 'users' column. >> > >> But >> > this >> > >> doesn't look at the history, only the current use. It gives you still >> > some >> > >> idea, but its not perfect. But reading the history is not an option >> > >> at >> > the >> > >> moment, because this would need far more resources. >> > >> >> > >> The number of users is also taking into account when creating the tag >> > cloud >> > >> for the home page. This way some tags from imports which are very >> > >> common >> > in >> > >> the database but have a small user count are downgraded. :-) >> > >> >> > >> Jochen >> > > >> > > Is it planned to have users count for the individual key pages? It can >> > > be interesting to see how popular common_key=some_exotic_value really >> > > is. >> > > Sometimes it is used frequently, but by a single mapper only. >> > >> > Users are counted for keys only and not for key=value combinations, >> > because >> > there are just too many key=value combinations and too many users to do >> > this >> > counting efficiently. At least I haven't come up with an idea how to do >> > this. >> > maybe somebody else can. >> > >> > Currently for every key I create a hash with all users in it, that use >> > this >> > key. When I am through all the tags, I count how many elements there are >> > in >> > each hash and thats the number of users for this key. This is rather >> > inefficient and could probably be improved using some clever hashing for >> > the price of some inaccuracies (which don't matter too much in this >> > case, >> > all we really want to know is roughly how many users there are). >> >> >> Very nice work you have and a very neat tool you wrote. >> >> I have an idea that might help with these counts. Bloom filters. For >> instance, a few months ago, I did a test where I counted the frequency of >> different various values for each key across a whole planet in only a few >> tens of megabytes of RAM. I used a hash table for counting, but when there >> were too many distinct values for a key (>1000), I switched to a bloom >> filter of 1000000 bits and could only report the number of distinct values >> for that key. If there were >500000 distinct values, I just reported >> ">500000". Something similar might work in other areas of your design. I >> think it could solve this: >> >> > But even when this is done in a more efficient manner, we can't to that >> > for 50 million different key=value combinations. We might be able to do >> > it for the more popular combinations, after all if a key=value >> > combination >> > only appears twice in the whole database, it doesn't really matter if >> > that >> > was from one or two users. >> > >> >> in about 8gb of RAM. Have a smallish bloom filter, say, 64 bits for each >> of >> the 50m distinct tags. Store the user that uses that tag in the bloom >> filter. You'll need 50m of them (only 400mb), plus the hash table from tag >> to bloom. Then, figure out which bloom filters have only one bit set. >> Those >> tag pairs have, with high probability, only been written by one user. In a >> second pass, figure out which user was responsible for causing that bit to >> be set. The hash table from tags to bloom filters will require more memory >> than the filters themselves, but you could cheat. Have an array of 512 >> million 64-bit bloom filters. Use hashing to map from a tag to the index >> of >> the bloom filter in that array and ignore the chances of collisions. To >> detect accidental collisions between where two tags map to the same bloom >> filter, or two users map to the same index in the bloom filter, use a >> keyed >> hash function and repeat the above a 3-5 times with different keys. Use >> the >> minimum value across the different passes. > > I have been very reluctant to do anything that gives me results that are not > 100% accurate. But I see that this might be too much to ask for with the > kind > of ressources we have. > > There are some interesting ideas here. In this case the bloom filter could > give > me exactly the needed answer, namely whether there are lots of users using a > tag or only a few. I already have the hash table from tag to some structure > that I use to count several things, so I only need the bloom filter. But the > bloom filter would have to be bigger than 64 bits, wouldn't it? User IDs are > in the range from 1 to about 360000 currently. If I want to know with a > certain probability whether a tag has less than or more than n users, how > can I figure out how big the bloom filter has to be? > > Jochen > -- > Jochen Topf joc...@remote.org http://www.remote.org/jochen/ > +49-721-388298 > > > _______________________________________________ > talk mailing list > talk@openstreetmap.org > http://lists.openstreetmap.org/listinfo/talk > -- Twitter: @Acrosscanada Blogs: http://acrosscanadatrails.posterous.com/ http://Acrosscanadatrails.blogspot.com Facebook: http://www.facebook.com/sam.vekemans Skype: samvekemans IRC: irc://irc.oftc.net #osm-ca Canadian OSM channel (an open chat room) @Acrosscanadatrails _______________________________________________ talk mailing list talk@openstreetmap.org http://lists.openstreetmap.org/listinfo/talk