>
> optimize this in the future. :) I saw something about small strings here
> recently, and how this might decrease storage requirements by 80%. Is
> that just for strings, or across the entire database? For reference, the
> uncompressed OSM data is only 5G so a 6X increase worries me.
>

I answered this in more detail in reply to Peters email, but yes, we are
aware of this issue and some work has already been done to address this.
Read the other email for details.

Regarding the performance of the import, Peters work on BDB is to see if we
can replace Lucene for the osm-id lookup, because that is what is hurting
the import of large OSM files. The OSM contains all nodes, followed by all
ways with references back to their nodes through the osm-id, and so we need
a fast lookup of the osm-id when connecting the nodes to the ways, and this
is done with lucene. It works well for small to medium size files, but for
very large ones, the lucene performance degrades a lot. Other than Peters
work on BDB, I am also working on supporting changesets in the graph, and if
there are strong correlations between node-changeset and way-changeset, then
this might be a faster lookup than the lucene or bdb index, at least for
large OSM files. Basically we would use lucene to lookup the changeset node
by osm-id (much smaller index than that for the nodes osm-id), and traverse
from there to the nodes. If ways have on average 10 nodes, and we have 100%
correlation, we would get up to 10x faster lookups (minus the traversal
step, and other possible overhead).

So at the above link, you see a list of nearby points, along with a
> direction and distance. The same information is also exported via web
> services to mobile apps in a scrollable list. What I'd hoped to do was
> to create an infinitely-scrolling list as commonly found in other
> Android apps. Unfortunately, it's hard to do that by simply bumping out
> the distance, since you can't necessarily know whether adding X to the
> bounding box dimensions might give you no points or another thousand. :)
> Since MongoDB can do this with a rather limited geospatial
> implementation, I was hoping that Neo4J or something tuned for spatial
> queries would *really* rock at it.
>

Internally Neo4j Spatial is working with bounding boxes too. I have to
assume MongoDb does also. Our RTree index is optimized for general spatial
objects, and while it can do distance queries, I think there are better
ways. I'm thinking that a fixed size grid b-tree like index, similar to my
own 'amanzi-index' would work very well for distance searches, because
instead of traversing from the top of the tree, you would traverse from the
central point outwards radially. This would be one way to have the graph
really benefit the distance search. We have considered plugging in other
indices into Neo4j Spatial, but have not actually done it (yet).

It is still worth trying out the current index, it might be fast enough, not
for infinite scrolling lists, but at least for the first many pages (so
effectively infinite from the users point of view).

>
> http://www.slideshare.net/craigtaverner/neo4j-spatial-backing-a-gis-with-a-true-graph-database
>
> I'd appreciate a description of those slides if you or someone else has
> the time, as I can't see them. :) In any case, I'll try playing with the
> database myself once this import finishes and see what I can learn that
> way.
>

Why can you not see them? Site blocked? Shall I email you the original
presentation?

> If you want to do this in Java, take a look at the sample code at
> >
> https://github.com/neo4j/neo4j-spatial/blob/master/src/test/java/org/neo4j/gis/spatial/TestDynamicLayers.java
>
> I'm doing this in Scala, so thanks for pointing me to the Java example.
> I've been seeing lots of old testcases and such so it was tough to find
> ones that were still valid.
>

The directory containing that test class contains many others, and all of
them are run every night, so they should all work. Neo4j Spatial is sadly
lacking in good documentation, so these test classes are the best place to
investigate how to use it.

So I wrote a simple Scala script that is currently importing OSM data,
> and this raises a few questions. The README shows a shapefile being
> imported without a BatchInserter. Is there any way to do this with OSM
> as well? One feature I'd like to provide at some point are automatic map
> updates, so each week an OSM changeset would be fetched and merged in.
> It would be great if I could do that merge without having to shut down
> the live database, and to handle that changeset merge as a single
> transaction. I gather changeset imports aren't yet possible, but is
> there any way to forgo the BatchInserter? Even if the process is slower,
> it wouldn't necessarily have to complete quickly, and if it chugged away
> for a few days in the background then I'd be fine with that.
>

The readme is rather out of date, written before we supported OSM import. As
described in my previous email, there are many easy ways to import OSM, and
since you got it working in Scala, I guess you figured it out OK.

The question of the batch inserter was also discussed in Peters reply, and
my reply to his reply. So, yes, we have new code that imports with the
normal API. I also mentioned the new support for changeset (only on my
laptop, not pushed yet). This does not support the full changeset merge you
are asking for, but is a step towards that.

Also, is there any means of determining how many OSM entities have been
> imported? I have an Import class that tries to track this for display
> purposes. My first attempt overrode the various create methods on
> BatchInserter to increment a counter, but that's currently at 90 million
> in a file that should only have 20 million or so entities. :)
>

The graph created has a tree structure rooted at the database root node. You
can traverse down to a collection of nodes and a collection of ways. The
Ruby wrapper actually exposes node and way counts through this, so you could
look at that code to see how it is done.

The reason the database node count is so high is that in order to support
the fact that nodes can be shared by many ways (at junctions and
intersections, and touching edges), we needed to great two database nodes
for every osm-node. So the total number of database nodes is usually twice
the number of osm-nodes, plus three times the number of way nodes, plus some
more for relations, changesets and users. Since the nodes are most numerous,
this 2xnodes is the main issue. This has a knock-on effect on relationships,
since each osm-node results in at least three relationships in the graph. So
the relationship count is high.

While it is possible to cut this down, it would make the graph more
complicated to traverse so we would increase the complexity of the code, and
possible reduce performance. We will only consider that route if we really
need to.

Finally, I'm seeing lots of geospatial queries and have a basic
> understanding of how to do those, but could you point me to sample Java
> code for pulling an entity with a specific ID out of the database? Since
> my current code is very minimally geospatial, it relies heavily on doing
> bounding box searches and then pulling entities out based on their ID.
> IOW, how would I query the graph database to retrieve an OSM node with
> the ID 123456789? I'm guessing there isn't a direct correspondence
> between OSM and Neo4J IDs, since that'd likely lead to collision.
>

As discussed above, we use a lucene index to track the OSM node ids (I
usually referred to them as the osm-id above). Obviously the neo4j id is not
the same number, as you suspected, so we needed this index. This is also the
bottleneck for loading large OSM files.

For sample code for this, look at at the getNodes method on line 500 of
OSMImporter<https://github.com/neo4j/neo4j-spatial/blob/master/src/main/java/org/neo4j/gis/spatial/osm/OSMImporter.java>
.

Regards, Craig
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Reply via email to