The cache I am referring to is in fact the lucene cache, and, as the 
details I provided state, I do need to read about 1/3 of the nodes. I have 
read the documentation and I am flushing the cache before I read. My 
question, then, is how to apportion available memory between the lucene 
cache and mmio. I haven't seen any commentary on this, and it seems worth 
talking about.

I have tried the hashing/trove-map approach with poor results, but I am 
willing to try it again. The problem with maps, or course, is that you have 
to keep all nodes you need to recall in memory, or they are useless. But I 
would also be interested to hear a discussion of the trade-offs between 
reading from the lucene cache and using maps/hashes. 

As for RAM and N, those are flexible, you might say. I am currently 
prototyping a system that could have a much larger (undetermined) N, and 
which may therefore require much more RAM than is on the current system. So 
you see, I am seeking a general formula wherein RAM and N are variables, in 
order not only to tune for performance but also to estimate future resource 
requirements. That is why I would rather deal in abstractions. Also, I 
think it would be more beneficial to the group al large, since other 
members have different N and RAM.

However, if you insist, I can give some particulars:

Here is my general approach: All nodes that will need to be read later 
(about a third of total nodes) are written first and flushed. Then the 
other 2/3 of the nodes that will only be written and all relationships are 
written. During this phase, the previously written nodes are read in order 
to construct the relationships.

I have 96 GB of RAM on my prototyping system. I estimate that the data 
sample I am working with will have: 
0.7 billion read-write nodes
1.5 billion write-only nodes
4.0 billion write-only relationships

Please bear in mind that N may grow at an unpredictable rate, and RAM will 
grow as a function of my estimates, which may be based at least in part on 
this discussion.  

On Monday, February 3, 2014 12:32:05 PM UTC-8, [email protected] wrote:
>
> In a batch insertion using indexes, given a huge set of nodes and 
> relations such that the node and relationship store cannot fit in mapped 
> memory, how should one divide memory between MMIO and index caches to 
> achieve optimal performance? I am already somewhat familiar with how to 
> divide memory within the mapped-memory schema. I am mainly interested in 
> the overall allotment of memory between MMIO and the caches. I think a 
> general answer to this would be useful to the community at large, but here 
> are some specifics about my case:
>
> - N nodes
> - 2*N relationships
> - Only about 30% of nodes are cached because the rest are never read.
> - Relationships are not cached because they are never read.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to