Thanks, Tamas. I'll add a little more color to what I'm doing. I'm actually parsing the output format of https://pypi.python.org/pypi/meliae (which ends up as a list of json objects that have attributes about python objects in memory, and references to other objects). So, I can't use any of the built in igraph loaders. Instead, I'm using add_vertex for each object as I'm reading it, and the queueing up batches of edges to add in bulk periodically.
The advantage I see of memory mapping is that if we can get to the point where a particular igraph_t (and all of the data it references) is backed by a memory mapped file, then loading it doesn't involve parsing at all. Instead, it's just a matter of mapping the file contents into memory again, and wiring up the pointers to point to the right file offsets. I haven't understood enough of the igraph code yet to know whether reallocs of the vertex and edge lists are well contained, but it should be the case that the knowledge of memory-mapped or not can be contained to a relatively small location. Also, I don't think it's the case that "you have to load the entire graph into memory at some point". If using a memory mapped file, I think you'd be able to rely on the os to only have the subset of pages resident that you actually need to run the particular graph operations. Now, it may be that some operations will force you to walk through the entire graph, which would introduce some memory pressure as pages are paged in and out, but I don't think that would be worse that the current strategy of having all of the nodes dynamically allocated. -Cale On Tue Feb 24 2015 at 10:27:59 AM Tamas Nepusz <[email protected]> wrote: > > I would like to store that graph in a persistent format, and it seems > like > > igraph using a memory-mapped file (mmap) would do the trick. Has anyone > > attempted that in the past? > I don't think so, and actually, I'm not convinced about the advantages of > this > approach -- the thing is that igraph would have to load the entire graph > into > memory at some point anyway, so you cannot avoid reading through the whole > file. > > I assume that you are using the edge list format (which is the fastest to > load) > and I did some profiling on my machine with a graph the size of yours (but > generated with igraph_erdos_renyi_game). On my machine, it took ~25 > seconds to > load the entire graph, and about 38.7% of that time was spent in > igraph_create() which converts an edge list (presented as a list of > numbers) > into an igraph_t object. This is the part that cannot be avoided, no matter > what format you use to store your graph on the disk. > > The part where we can probably save some time is the calls to fscanf(), > getc(), > ungetc() and feof(), which accounted for 31.9% + 12.3% + 6.6% + 5.8% of > the time, > respectively. These calls are necessary due to how the nodes are stored in > the > file: they are written as strings, which have to be parsed. You could > probably > save a lot of time by using a binary graph format and then creating an > appropriate reader and writer for that. I am thinking about something like > this: > > first, use 4 bytes to store the number of vertices and maybe one byte to > store > whether the graph is directed or undirected > then, for each vertex, > store the number of neighbors in 4 bytes > and then store the indices of the actual neighbors on 4 bytes each > > This has the advantage that there is no need to call fscanf(), getc(), > ungetc() > and feof() -- you can simply read chunks of four bytes as needed and there > is > no need to check for EOF repeatedly because you "know" in advance how many > vertices there will be. It also uses a more compact representation because > n edges originating from the same vertex use only (n+1)*4 bytes, so there > will > be less data to read from the disk. Also, due to the way how the edge list > is > stored in this case, the edge list will be sorted by source vertices by > default, and this might also save some more time in the call to > igraph_create() > later on. > > Actually, igraph already contains a reader which is very similar to this; > it is > called igraph_read_graph_graphdb(). The only difference is that it handles > undirected graphs only and it stores vertex indices on two bytes so it > scales up to 65535 vertices only. It is probably easy to adapt the source > code > of igraph_read_graph_graphdb() to make it use four bytes per vertex and > then > you are essentially ready. (And maybe it would be a nice addition to the > core > library as well, although it's probably Gabor's call in the end and not > mine). > > All the best, > -- > T. > > _______________________________________________ > igraph-help mailing list > [email protected] > https://lists.nongnu.org/mailman/listinfo/igraph-help >
_______________________________________________ igraph-help mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/igraph-help
