I'm writing a script to extract bits out of the OSM full-planet file. The full-planet differs from the regular planet file in several ways. One of the biggest is that it contains every version of every object ever. I want my script to be able to grab just the latest version up to a specified date (also does extracts based on bbox). It's not hard to do but I want it to be as fast as possible.
Right now I am making two passes through the file. The first pass, I build sets containing the unique ID for each feature I want to keep. The second pass outputs what's been selected to be kept. Two passes are necessary because nodes are listed before ways. I want to be able to grab every node in a way when the way is clipped by the bbox. The set works great but now I want to save the ID and proper version for each object to be kept. Sets are lightning fast, especially for simple membership testing across large sets. But now I have two values I want to cram into the set. One way I thought of is to "hash" the version into the ID. I could either append it to the end: ID=12345 ver=6 hash = 123456 Another is to make the version a decimal: hash = 12345.6 The least "cute" way to handle it would be to create pairs: hash = (12345, 6) I'm not a big fan of "cute tricks" in code. But I also want this to be fast. This is running over a file that's quickly approaching 500GB. The first option would seem to be fastest. Of course, the fastest of all would be to just chop ways at the bounding box and only run through the file once... Anyone have an opinion here? -Eric
