On Thu, Apr 28, 2011 at 9:02 AM, Eric Wolf <[email protected]> wrote: > I'm writing a script to extract bits out of the OSM full-planet file. > The full-planet differs from the regular planet file in several ways. > One of the biggest is that it contains every version of every object > ever. I want my script to be able to grab just the latest version up > to a specified date (also does extracts based on bbox). It's not hard > to do but I want it to be as fast as possible. > > Right now I am making two passes through the file. The first pass, I > build sets containing the unique ID for each feature I want to keep. > The second pass outputs what's been selected to be kept. Two passes > are necessary because nodes are listed before ways. I want to be able > to grab every node in a way when the way is clipped by the bbox. > > The set works great but now I want to save the ID and proper version > for each object to be kept. Sets are lightning fast, especially for > simple membership testing across large sets. But now I have two values > I want to cram into the set. > > One way I thought of is to "hash" the version into the ID. I could > either append it to the end: > > ID=12345 > ver=6 > > hash = 123456 > > Another is to make the version a decimal: > > hash = 12345.6 > > The least "cute" way to handle it would be to create pairs: > > hash = (12345, 6) > > I'm not a big fan of "cute tricks" in code. But I also want this to be > fast. This is running over a file that's quickly approaching 500GB. > The first option would seem to be fastest. > > Of course, the fastest of all would be to just chop ways at the > bounding box and only run through the file once... > > Anyone have an opinion here? > > -Eric
I suspect any trick that doesn't yield integer keys won't buy you much. Would a BTree, using id as key and version as value, help you? The ZODB package includes very fast and efficient trees and sets and doesn't require Zope: http://pypi.python.org/pypi/ZODB3/3.10.3. You might profit from taking this question to Stack Overflow. -- Sean
