> > optimize this in the future. :) I saw something about small strings here > recently, and how this might decrease storage requirements by 80%. Is > that just for strings, or across the entire database? For reference, the > uncompressed OSM data is only 5G so a 6X increase worries me. >
I answered this in more detail in reply to Peters email, but yes, we are aware of this issue and some work has already been done to address this. Read the other email for details. Regarding the performance of the import, Peters work on BDB is to see if we can replace Lucene for the osm-id lookup, because that is what is hurting the import of large OSM files. The OSM contains all nodes, followed by all ways with references back to their nodes through the osm-id, and so we need a fast lookup of the osm-id when connecting the nodes to the ways, and this is done with lucene. It works well for small to medium size files, but for very large ones, the lucene performance degrades a lot. Other than Peters work on BDB, I am also working on supporting changesets in the graph, and if there are strong correlations between node-changeset and way-changeset, then this might be a faster lookup than the lucene or bdb index, at least for large OSM files. Basically we would use lucene to lookup the changeset node by osm-id (much smaller index than that for the nodes osm-id), and traverse from there to the nodes. If ways have on average 10 nodes, and we have 100% correlation, we would get up to 10x faster lookups (minus the traversal step, and other possible overhead). So at the above link, you see a list of nearby points, along with a > direction and distance. The same information is also exported via web > services to mobile apps in a scrollable list. What I'd hoped to do was > to create an infinitely-scrolling list as commonly found in other > Android apps. Unfortunately, it's hard to do that by simply bumping out > the distance, since you can't necessarily know whether adding X to the > bounding box dimensions might give you no points or another thousand. :) > Since MongoDB can do this with a rather limited geospatial > implementation, I was hoping that Neo4J or something tuned for spatial > queries would *really* rock at it. > Internally Neo4j Spatial is working with bounding boxes too. I have to assume MongoDb does also. Our RTree index is optimized for general spatial objects, and while it can do distance queries, I think there are better ways. I'm thinking that a fixed size grid b-tree like index, similar to my own 'amanzi-index' would work very well for distance searches, because instead of traversing from the top of the tree, you would traverse from the central point outwards radially. This would be one way to have the graph really benefit the distance search. We have considered plugging in other indices into Neo4j Spatial, but have not actually done it (yet). It is still worth trying out the current index, it might be fast enough, not for infinite scrolling lists, but at least for the first many pages (so effectively infinite from the users point of view). > > http://www.slideshare.net/craigtaverner/neo4j-spatial-backing-a-gis-with-a-true-graph-database > > I'd appreciate a description of those slides if you or someone else has > the time, as I can't see them. :) In any case, I'll try playing with the > database myself once this import finishes and see what I can learn that > way. > Why can you not see them? Site blocked? Shall I email you the original presentation? > If you want to do this in Java, take a look at the sample code at > > > https://github.com/neo4j/neo4j-spatial/blob/master/src/test/java/org/neo4j/gis/spatial/TestDynamicLayers.java > > I'm doing this in Scala, so thanks for pointing me to the Java example. > I've been seeing lots of old testcases and such so it was tough to find > ones that were still valid. > The directory containing that test class contains many others, and all of them are run every night, so they should all work. Neo4j Spatial is sadly lacking in good documentation, so these test classes are the best place to investigate how to use it. So I wrote a simple Scala script that is currently importing OSM data, > and this raises a few questions. The README shows a shapefile being > imported without a BatchInserter. Is there any way to do this with OSM > as well? One feature I'd like to provide at some point are automatic map > updates, so each week an OSM changeset would be fetched and merged in. > It would be great if I could do that merge without having to shut down > the live database, and to handle that changeset merge as a single > transaction. I gather changeset imports aren't yet possible, but is > there any way to forgo the BatchInserter? Even if the process is slower, > it wouldn't necessarily have to complete quickly, and if it chugged away > for a few days in the background then I'd be fine with that. > The readme is rather out of date, written before we supported OSM import. As described in my previous email, there are many easy ways to import OSM, and since you got it working in Scala, I guess you figured it out OK. The question of the batch inserter was also discussed in Peters reply, and my reply to his reply. So, yes, we have new code that imports with the normal API. I also mentioned the new support for changeset (only on my laptop, not pushed yet). This does not support the full changeset merge you are asking for, but is a step towards that. Also, is there any means of determining how many OSM entities have been > imported? I have an Import class that tries to track this for display > purposes. My first attempt overrode the various create methods on > BatchInserter to increment a counter, but that's currently at 90 million > in a file that should only have 20 million or so entities. :) > The graph created has a tree structure rooted at the database root node. You can traverse down to a collection of nodes and a collection of ways. The Ruby wrapper actually exposes node and way counts through this, so you could look at that code to see how it is done. The reason the database node count is so high is that in order to support the fact that nodes can be shared by many ways (at junctions and intersections, and touching edges), we needed to great two database nodes for every osm-node. So the total number of database nodes is usually twice the number of osm-nodes, plus three times the number of way nodes, plus some more for relations, changesets and users. Since the nodes are most numerous, this 2xnodes is the main issue. This has a knock-on effect on relationships, since each osm-node results in at least three relationships in the graph. So the relationship count is high. While it is possible to cut this down, it would make the graph more complicated to traverse so we would increase the complexity of the code, and possible reduce performance. We will only consider that route if we really need to. Finally, I'm seeing lots of geospatial queries and have a basic > understanding of how to do those, but could you point me to sample Java > code for pulling an entity with a specific ID out of the database? Since > my current code is very minimally geospatial, it relies heavily on doing > bounding box searches and then pulling entities out based on their ID. > IOW, how would I query the graph database to retrieve an OSM node with > the ID 123456789? I'm guessing there isn't a direct correspondence > between OSM and Neo4J IDs, since that'd likely lead to collision. > As discussed above, we use a lucene index to track the OSM node ids (I usually referred to them as the osm-id above). Obviously the neo4j id is not the same number, as you suspected, so we needed this index. This is also the bottleneck for loading large OSM files. For sample code for this, look at at the getNodes method on line 500 of OSMImporter<https://github.com/neo4j/neo4j-spatial/blob/master/src/main/java/org/neo4j/gis/spatial/osm/OSMImporter.java> . Regards, Craig _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user