Hi, The reason why I believe using the hash code is bad for performance is that it locality of nodes in the backend storage (MongoDB, or any kind of database) is worse than now. This affects performance when writing, and when reading siblings. Let's assume we have the nodes /page_1/paragraph_0..2/child_0..2/ (that is, 3 paragraph nodes, and each paragraph node has 3 child nodes). With the current approach, the entries are stored as follows:
1:/page_1 ... 2:/page_1/paragraph_0 2:/page_1/paragraph_1 2:/page_1/paragraph_2 ... 3:/page_1/paragraph_0/child_0 3:/page_1/paragraph_0/child_1 3:/page_1/paragraph_0/child_2 3:/page_1/paragraph_1/child_0 3:/page_1/paragraph_1/child_1 3:/page_1/paragraph_1/child_2 3:/page_1/paragraph_2/child_0 3:/page_1/paragraph_2/child_1 3:/page_1/paragraph_2/child_2 The child documents (child_0 .. child_2) of all paragraphs are stored next to each other, likely in the same b-tree page, which means likely in the same sector on the hard drive. Even if there are many other pages (page_2, page_3,...). That's good for caching. With a hash based approach, data is distributed over the whole keyspace. Even thought child_0 .. child_2 of each paragraph are still next to each other, locality is much worse. If we assume 3 child nodes per node on average, we almost have the same situation as with Jackrabbit 2.x, where node id lookup was by far the biggest performance problem. Because node ids are randomly distributed in Jackrabbit 2. And using randomly distributed primary keys is one of the worst performance problems when using databases. For example, the default object id of MongoDB is sequential (time based): http://docs.mongodb.org/manual/reference/object-id/. I'm not worried at all about the overhead to calculate the hash code of the path. That is insignificant. It's the random distribution of the data that is problematic. See also: http://kccoder.com/mysql/uuid-vs-int-insert-performance/ (specially the first two charts comparing auto_increment versus uuid) http://www.mysqlperformanceblog.com/2007/03/13/to-uuid-or-not-to-uuid/ "What is the biggest problem with UUID ? It is the fact inserts are going in random locations in index tree which will require a lot of IO in case of index tree not fitting into memory" http://stackoverflow.com/questions/2365132/uuid-performance-in-mysql "At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way). It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic...". https://issues.apache.org/jira/browse/JCR-2857 Regards, Thomas
