Re: [Neo4j] Neo4j low-level data storage

2011-10-10 Thread danielb
First thanks to all of you for the hints and the detailed explanation of
Craig. Peter I was not into any code in detail yet, but I have had a look on
the (beta) wiki of Neo4j spatial.

Some more questions on the storage format of Neo4j: So the nodes, properties
and relationships are stored in seperate files with their own id space. If I
load information from the database I will probably have to hit each store.
So lets say I have a large database of several gigabytes I will end up with
three sequentiell reads (at best) at three different locations on the
harddisk. If I store everything in three tables in the relational approach I
will come to the same result in terms of data locality - if my assumption is
correct. So no way to do a bounding box read from contiguous space when
using all of the datastores?
The next question goes to the way how OSM ways are stored. I thought the
relationships connect the nodes to a way like is-part-of. Now there are
proxy nodes which connect the nodes to a way. How does the relationships
now come into play? From some presentations I have seen that nodes have
lat/lon and name information. Was that only for illustration and that
information would normally be stored in the property store or how many
information can be stored in a node at all? This would favor the way I can
retrieve data then.
Lets stay at OSM: How would I store a closed way which isn't connected to
anything else? It would have no connection to the rest of the graph. How
would that fit from a design perspective?  

For my master thesis I will have to write comparable importers for Neo4j and
PostGIS. Your proposal to lower fragmentation sounds reasonable. I think I
can adapt that (or any other approach that comes to my mind) in a branch and
also add configuration options to the importer.

Regards, Daniel

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3409162.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Neo4j low-level data storage

2011-10-07 Thread danielb
Hello Chris,

thanks for your postings, they are a great starting point for me to assume a
good performance of this database. However I have some questions left.
Lets say I have a node store and a property store. Both of them are
individual files. I am going to implement a GIS application which fetches
OSM data from the hard disk. When I load a bounding box I want to avoid to
many random reads from the disk.
More precisely I want to load nodes and properties inside a given bounding
box. It would be great if both the nodes and the properties are organized in
successive blocks. Is there one id-pool for both nodes and properties, so
that I can load for example the nodes with id 1 and 2 and the properties 3,
4 and 5 with one block read? I can be totally wrong because if I save a new
node file with id 1, 2 and then save a new property file with id 3, it will
start on a new block (windows block size like 4K). When then writing a new
node id it would be saved in the first block I guess. What about
fragmentation? And is there an improvement when using a Linux system
(inodes? I don't know Linux well)? When I am finished with saving the nodes
and properties is there some way of reorganization on the hard disk? Lets
say I want to enter a new node which is connected to a low id. Will it get
the first free id (and it will be saved on the other end of the harddisk
perhaps) or does it just get an allready used id and the following records
will be reorganized (insert performance)?
Maybe I am totally wrong about this, but I would appreciate an efficient way
of storage for GIS data.

best regards, Daniel 

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3402827.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Neo4j low-level data storage

2011-10-07 Thread Peter Neubauer
Daniel,
for OSM data and GIS, have you looked at
https://github.com/neo4j/spatial, especially the OSM examples at
https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/pipes/GeoPipesTest.java
and 
https://github.com/neo4j/spatial/blob/master/src/test/java/org/neo4j/gis/spatial/OsmAnalysisTest.java
?

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://startupbootcamp.org/    - Öresund - Innovation happens HERE.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Fri, Oct 7, 2011 at 3:46 PM, danielb danielbercht...@gmail.com wrote:
 Hello Chris,

 thanks for your postings, they are a great starting point for me to assume a
 good performance of this database. However I have some questions left.
 Lets say I have a node store and a property store. Both of them are
 individual files. I am going to implement a GIS application which fetches
 OSM data from the hard disk. When I load a bounding box I want to avoid to
 many random reads from the disk.
 More precisely I want to load nodes and properties inside a given bounding
 box. It would be great if both the nodes and the properties are organized in
 successive blocks. Is there one id-pool for both nodes and properties, so
 that I can load for example the nodes with id 1 and 2 and the properties 3,
 4 and 5 with one block read? I can be totally wrong because if I save a new
 node file with id 1, 2 and then save a new property file with id 3, it will
 start on a new block (windows block size like 4K). When then writing a new
 node id it would be saved in the first block I guess. What about
 fragmentation? And is there an improvement when using a Linux system
 (inodes? I don't know Linux well)? When I am finished with saving the nodes
 and properties is there some way of reorganization on the hard disk? Lets
 say I want to enter a new node which is connected to a low id. Will it get
 the first free id (and it will be saved on the other end of the harddisk
 perhaps) or does it just get an allready used id and the following records
 will be reorganized (insert performance)?
 Maybe I am totally wrong about this, but I would appreciate an efficient way
 of storage for GIS data.

 best regards, Daniel

 --
 View this message in context: 
 http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3402827.html
 Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Neo4j low-level data storage

2011-10-07 Thread Craig Taverner
I think Daniels questions are very relevant, but not just to OSM. Any large
graph (of which OSM is simply a good example) will be affected by
fragmentation, and that can affect performance. I recently was hit by
performance of GIS queries (not OSM) related to fragmentation of the index
tree. I will describe that problem below, but first let me describe my view
on Daniels question.

It is true that if parts of the graph that are geographically close are also
close on disk the load time for bounding box queries will be faster.
However, this is not a problem that is easy to solve in a generic way,
because it requires knowledge of the domain. I can see two ways to create a
less fragmented graph:

   - Have a de-fragmenting algorithm that re-organizes an existing graph
   according to some rules. This does not exist in neo4j (yet?), but is
   probably easier to generalize, since it should be possible to first analyse
   the connectedness of the graph, and then defragment based on that. This
   means a reasonable solution might be possible without knowing the domain.
   - Be sure to load domain specific data in the order you wish to query it.
   In other words, create a graph that is already de-fragmented.

This second approach is the route I have started following (at least I've
taken one or two tiny baby-steps in that direction, but plan for more). In
the case of the OSM model produced by the OSMImporter in Neo4j-Spatial, we
do not do much here. We are importing the data in the order it was created
in the original postgres database (ie. in the order it was originally added
to open street map). However, since the XML format puts ways after all
nodes, we actually also store all ways after all nodes, which means that to
load any particular way completely from the database requires hitting disk
at at least two very different locations, the location of the way node and
the interconnects between the nodes, and the location(s) of the original
location nodes. This multiple hit will occur on the nodes, relationships and
properties tables in a similar way. So I can also answer a question Daniel
asked about the ids. The Neo4j nodes, relationships and properties have
their own id space. So you can have node 1, relationship 1 and property 1.
Lets consider a real example, a street made of 5 points, added early to OSM
(so low id's in both postgres and in neo4j). The OSM file will have these
nodes near the top, but the way that connects them together will be near the
bottom of the file. In Postgres the nodes and ways are in different tables,
and will both be near the top. In neo4j both osm-ways and osm-nodes are
neo4j-nodes (in the same 'table'). The osm-nodes will have low ids, but the
ways will have a high id. Also we use proxy nodes to connect osm-ways to
osm-nodes, and these will be created together with the way. So we will have
5 nodes with low ids, and 8 nodes with high id's (5 proxy nodes, 1 way node,
1 geometry node and 1 tags node). If the way was big and/or edited multiple
times, we could get even higher fragmentation. Personally I think that
fragmenting one geometry into a few specific locations is not a big problem
for the neo4j caches. However, when we are talking about a result-set or
traversal of thousands or hundreds of thousands of geometries, then doubling
or tripling the number of disk hits due to fragmentation can definitely have
a big impact.

How can this fragmentation situation be improved? One idea is to load the
data with two passes. The current loader is trying to optimize OSM import
speed, which is difficult already (and slower than in rdbms due to increased
explicit structure), and so we have a single pass loader, with a lucene
index for reconnecting ways to nodes. However, I think we could change this
to a two pass loader, with the first pass reading and indexing the point
nodes into a unique id-index (for fast postgres id lookup), and the second
pass would connect the ways, and store both the nodes and ways to the
database at the same time, in continuous disk space. This would improve
query performance, and if we make a good unique-id index faster than lucene,
we will actually also improve import speed .. :-)

Now all of the above does not answer the original question regarding
bounding box queries. All we will have done with this is improve the load
time for complete OSM geometries (by reducing geometry fragmentation). But
what about the index itself. We are storing the index as part of the graph.
Today, Neo4j-spatial uses an RTree index that is created at the end of the
load in OSMImporter. This means we load the complete OSM file, and then we
index it. This is a good idea because it will store the entire RTree in
contiguous disk space. Sort of  there is one issue with the RTree node
splitting that will cause slight fragmentation, but I think it is not too
serious. Now when performing bounding box queries, the main work done by the
RTree will hit the minimum amount of disk space, until 

Re: [Neo4j] Neo4j low-level data storage

2011-10-07 Thread Michael Hunger
Lots of thoughts,

I just want to add a side note on store defragmentation.

Mattias and I wrote a little tool for a customer who was in need of recreation 
of his
store with some properties/relationships omitted but using the original 
node-ids.

We did that by iterating through the existing store and using the 
batch-inserters
createNode and createRelationship methods which take an _explict_ id value. So 
you can control which id's are assigned for which nodes. That would allow e.g. 
for 
the R-Trees to be layed out in a way that they fit in a single 
Persistence-Window and
can be read in one go (that's just the index) it would probably interesting to 
co-locate
the way-nodes inside the bounding-box frame too.

Cheers

Michael

Am 07.10.2011 um 22:52 schrieb Craig Taverner:

 I think Daniels questions are very relevant, but not just to OSM. Any large
 graph (of which OSM is simply a good example) will be affected by
 fragmentation, and that can affect performance. I recently was hit by
 performance of GIS queries (not OSM) related to fragmentation of the index
 tree. I will describe that problem below, but first let me describe my view
 on Daniels question.
 
 It is true that if parts of the graph that are geographically close are also
 close on disk the load time for bounding box queries will be faster.
 However, this is not a problem that is easy to solve in a generic way,
 because it requires knowledge of the domain. I can see two ways to create a
 less fragmented graph:
 
   - Have a de-fragmenting algorithm that re-organizes an existing graph
   according to some rules. This does not exist in neo4j (yet?), but is
   probably easier to generalize, since it should be possible to first analyse
   the connectedness of the graph, and then defragment based on that. This
   means a reasonable solution might be possible without knowing the domain.
   - Be sure to load domain specific data in the order you wish to query it.
   In other words, create a graph that is already de-fragmented.
 
 This second approach is the route I have started following (at least I've
 taken one or two tiny baby-steps in that direction, but plan for more). In
 the case of the OSM model produced by the OSMImporter in Neo4j-Spatial, we
 do not do much here. We are importing the data in the order it was created
 in the original postgres database (ie. in the order it was originally added
 to open street map). However, since the XML format puts ways after all
 nodes, we actually also store all ways after all nodes, which means that to
 load any particular way completely from the database requires hitting disk
 at at least two very different locations, the location of the way node and
 the interconnects between the nodes, and the location(s) of the original
 location nodes. This multiple hit will occur on the nodes, relationships and
 properties tables in a similar way. So I can also answer a question Daniel
 asked about the ids. The Neo4j nodes, relationships and properties have
 their own id space. So you can have node 1, relationship 1 and property 1.
 Lets consider a real example, a street made of 5 points, added early to OSM
 (so low id's in both postgres and in neo4j). The OSM file will have these
 nodes near the top, but the way that connects them together will be near the
 bottom of the file. In Postgres the nodes and ways are in different tables,
 and will both be near the top. In neo4j both osm-ways and osm-nodes are
 neo4j-nodes (in the same 'table'). The osm-nodes will have low ids, but the
 ways will have a high id. Also we use proxy nodes to connect osm-ways to
 osm-nodes, and these will be created together with the way. So we will have
 5 nodes with low ids, and 8 nodes with high id's (5 proxy nodes, 1 way node,
 1 geometry node and 1 tags node). If the way was big and/or edited multiple
 times, we could get even higher fragmentation. Personally I think that
 fragmenting one geometry into a few specific locations is not a big problem
 for the neo4j caches. However, when we are talking about a result-set or
 traversal of thousands or hundreds of thousands of geometries, then doubling
 or tripling the number of disk hits due to fragmentation can definitely have
 a big impact.
 
 How can this fragmentation situation be improved? One idea is to load the
 data with two passes. The current loader is trying to optimize OSM import
 speed, which is difficult already (and slower than in rdbms due to increased
 explicit structure), and so we have a single pass loader, with a lucene
 index for reconnecting ways to nodes. However, I think we could change this
 to a two pass loader, with the first pass reading and indexing the point
 nodes into a unique id-index (for fast postgres id lookup), and the second
 pass would connect the ways, and store both the nodes and ways to the
 database at the same time, in continuous disk space. This would improve
 query performance, and if we make a good unique-id index faster than lucene,
 we will 

[Neo4j] Neo4j low-level data storage

2011-09-14 Thread danielb
Hello everybody,

I have some questions regarding data storage in neo4j. How does neo4j store
the data on the physical level? Are elements which have a close relationship
in the graph stored in an adjacent way on disk? Any special binary format?
How do you treat SSDs compared to common harddisks? What are your measures
to improve I/O? Is there a technical documention which describes this?

regards,
Daniel

--
View this message in context: 
http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3336483.html
Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Neo4j low-level data storage

2011-09-14 Thread Chris Gioran
Hi,

shameless plug explanation=my blog
You can find a decent explanation for pre-1.4 storage layout here:

http://digitalstain.blogspot.com/2010/10/neo4j-internals-file-storage.html

It has not been updated however to explain extended addressing for 4B+
entities in a db (which do not change the layout, just the semantics
of some bits). Nevertheless, the basic structures (fixed size records,
doubly linked list etc) are still there.

For the way I/O is performed (and minimized/improved upon) you can
look at the way memory mapping is used here:

http://digitalstain.blogspot.com/2010/10/neo4j-internals-persistence-and-memory.html

And of course, there is object caching explained here:

http://digitalstain.blogspot.com/2010/10/neo4j-internals-caching.html

/ shameless plug

The organization in general is greatly favored by SSDs which are more
friendly to linked lists on disk. However, there is no optimization in
place yet targeting a specific storage technology - as in most
applications, the better the I/O performance, the better Neo4j does.

I'd be happy to help dive into more specific questions as they come to you.

cheers,
CG

On Wed, Sep 14, 2011 at 8:09 PM, danielb danielbercht...@gmail.com wrote:
 Hello everybody,

 I have some questions regarding data storage in neo4j. How does neo4j store
 the data on the physical level? Are elements which have a close relationship
 in the graph stored in an adjacent way on disk? Any special binary format?
 How do you treat SSDs compared to common harddisks? What are your measures
 to improve I/O? Is there a technical documention which describes this?

 regards,
 Daniel

 --
 View this message in context: 
 http://neo4j-community-discussions.438527.n3.nabble.com/Neo4j-low-level-data-storage-tp3336483p3336483.html
 Sent from the Neo4j Community Discussions mailing list archive at Nabble.com.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Neo4j low-level data storage

2011-09-14 Thread Marko Rodriguez
 shameless plug explanation=my blog

There is no shame in producing good work and sharing it with others.

Marko.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user