The cache I am referring to is in fact the lucene cache, and, as the details I provided state, I do need to read about 1/3 of the nodes. I have read the documentation and I am flushing the cache before I read. My question, then, is how to apportion available memory between the lucene cache and mmio. I haven't seen any commentary on this, and it seems worth talking about.
I have tried the hashing/trove-map approach with poor results, but I am willing to try it again. The problem with maps, or course, is that you have to keep all nodes you need to recall in memory, or they are useless. But I would also be interested to hear a discussion of the trade-offs between reading from the lucene cache and using maps/hashes. As for RAM and N, those are flexible, you might say. I am currently prototyping a system that could have a much larger (undetermined) N, and which may therefore require much more RAM than is on the current system. So you see, I am seeking a general formula wherein RAM and N are variables, in order not only to tune for performance but also to estimate future resource requirements. That is why I would rather deal in abstractions. Also, I think it would be more beneficial to the group al large, since other members have different N and RAM. However, if you insist, I can give some particulars: Here is my general approach: All nodes that will need to be read later (about a third of total nodes) are written first and flushed. Then the other 2/3 of the nodes that will only be written and all relationships are written. During this phase, the previously written nodes are read in order to construct the relationships. I have 96 GB of RAM on my prototyping system. I estimate that the data sample I am working with will have: 0.7 billion read-write nodes 1.5 billion write-only nodes 4.0 billion write-only relationships Please bear in mind that N may grow at an unpredictable rate, and RAM will grow as a function of my estimates, which may be based at least in part on this discussion. On Monday, February 3, 2014 12:32:05 PM UTC-8, [email protected] wrote: > > In a batch insertion using indexes, given a huge set of nodes and > relations such that the node and relationship store cannot fit in mapped > memory, how should one divide memory between MMIO and index caches to > achieve optimal performance? I am already somewhat familiar with how to > divide memory within the mapped-memory schema. I am mainly interested in > the overall allotment of memory between MMIO and the caches. I think a > general answer to this would be useful to the community at large, but here > are some specifics about my case: > > - N nodes > - 2*N relationships > - Only about 30% of nodes are cached because the rest are never read. > - Relationships are not cached because they are never read. > > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
