Hello all, I thought I'd share some profiling results from the graph/network library that I've been working on -- see: https://github.com/WebDrake/Dgraph http://braingam.es/2013/07/complex-networks-in-d/ http://braingam.es/2013/07/complex-networks-in-d-part-2-building-the-network/
I'd like to get others' thoughts and insight on these -- obviously I have some ideas of my own about how to interpret them and move forward, but it's always good to have feedback (and there are several things I'd like to make sure I have my facts straight on before making any more "public" comment). The basic data structure, which is taken from the igraph library, consists of a pair of arrays of head and tail vertices for edges, together with indices which sort the edges according to head or tail vertex, and sums of the total number of edges whose head/tail ID is less than a certain value: https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L33-L38 Taken together this offers a very memory-efficient structure that should also be able to minimize the total number of allocations required to build a graph. When I started working on the code the initial goal was to have something that gave a very simple and obvious representation of the algorithms at work, which would be easy to understand. So, for example, I used std.algorithm.map quite freely to make an "easy" representation of how a vertex's neighbours were calculated: https://github.com/WebDrake/Dgraph/blob/master/dgraph/graph.d#L260-L286 It turns out this results in a fairly major speed hit that stems from a variety of factors. The following is the result of profiling a simple function that just iterates over all the neighbours of each vertex in turn, and sums the neighbour IDs. The code is attached in neighbours.d. --------------------------------------------------------------------------------- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 19.20 0.24 0.24 50000000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda46TmZ10__lambda46MFNbNfmZxm 18.40 0.47 0.23 _D2gc3gcx2GC12mallocNoSyncMFmkPmZPv 15.60 0.67 0.20 50000000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result18__T10__lambda48TmZ10__lambda48MFNbNfmZxm 9.60 0.79 0.12 500000 0.00 0.00 _D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFKC6dgraph5graph13__T5GraphVb0Z5GraphZm 8.00 0.89 0.10 25000000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMxFymZS3std5range380__T5chainTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultTS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ5chainFS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda46TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultS6dgraph5graph13__T5GraphVb0Z5Graph10neighboursM123__T9MapResultS596dgraph5graph13__T5GraphVb0Z5Graph10neighboursM10__lambda48TS3std5range15__T4iotaTxmTxmZ4iotaFxmxmZ6ResultZ9MapResultZ6Result 8.00 0.99 0.10 _D2gc3gcx2GC6mallocMFmkPmZPv 4.80 1.05 0.06 _D2gc6gcbits6GCBits3setMFmZv 4.80 1.11 0.06 _D2gc3gcx3Gcx11fullcollectMFZm 4.00 1.16 0.05 _D2gc6gcbits6GCBits4testMFmZm 2.40 1.19 0.03 _D4core4sync5mutex5Mutex6unlockMFNeZv 1.60 1.21 0.02 _D2gc6gcbits6GCBits5allocMFmZv 1.20 1.22 0.02 _D3std4conv15__T6toImplTiTkZ6toImplFNaNfkZi15__dgliteral2501MFNaNfZC6object9Throwable 1.20 1.24 0.02 gc_malloc 0.40 1.24 0.01 _D4core4sync5mutex5Mutex12MonitorProxy11__xopEqualsFKxS4core4sync5mutex5Mutex12MonitorProxyKxS4core4sync5mutex5Mutex12MonitorProxyZb 0.40 1.25 0.01 _D4core4sync5mutex5Mutex4lockMFNeZv 0.40 1.25 0.01 gc_clrAttr --------------------------------------------------------------------------------- What's striking about this is (i) the various range/map based functions carry a substantial hit in and of themselves; (ii) there's also a fairly nasty hit from the GC. (If I run this with a directed graph, which avoids the use of std.range.chain, there's a little bit of saving, but only a very little.) As an experiment, I wrote some alternative code that provides a cache for all the neighbours of all the vertices. This takes an overall memory hit but means that one can avoid having to recalculate the list of neighbours so long as the graph itself doesn't change: https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L42-L53 https://github.com/WebDrake/Dgraph/blob/cache/dgraph/graph.d#L358-L379 The speed benefits are immediate: --------------------------------------------------------------------------------- Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 64.29 0.14 0.14 25000000 0.00 0.00 _D6dgraph5graph13__T5GraphVb0Z5Graph10neighboursMFNaNbNfymZAm 33.33 0.21 0.07 500000 0.00 0.00 _D10neighbours24__T15checkNeighboursVb0Z15checkNeighboursFNaNbNfKC6dgraph5graph13__T5GraphVb0Z5GraphZm 2.38 0.21 0.01 _D6dgraph5graph13__T5GraphVb0Z5Graph13incidentEdgesMFNaNbNfymZAm --------------------------------------------------------------------------------- In fact there's a double benefit here because (i) it's more efficient to write the neighbour values into an array than to use a map and (ii) because we're cacheing them, we only have to calculate them once. The extra memory required may be bearable given that the overall constraints on network size are probably going to be down to something other than RAM. However, I think it might be worth avoiding this if possible. So I'd like to be sure that I understand -- first, are map, chain, etc. likely to become more efficient in terms of speed and memory use? I recall that there has been some concern over unnecessary allocations, and I'm not sure if std.algorithm.map or std.range.chain are victims here. Alternatively, there might be some benefit in replacing the maps, chains, etc. with custom data structures that provide _exactly_ what is wanted. Finally, whether using the map approach or the cache, neither of these approaches seems as fast in benchmarking as a simple ad-hoc approach as I might use for a graph in simulation, such as, size_t[][] neighbours; My instinct is that the latter approach will start to choke when you start trying to build a really big graph because of the many, many reallocations required to build the structure. It will probably also become unwieldy if you start wanting to allow arbitrary collections of edge and vertex properties (weight, colour, etc.) because it doesn't allow for a very elegant way to store and retrieve those properties. Suffice to say that it looks like there are an interesting range of compromises to make depending on needs, and it's quite likely there's a benefit in providing different graph data structures depending on what scale of graph (and what kind of graph properties) you want to work with. Anyway, although this is a very limited bit of profiling work, I'd be very grateful for anyone's thoughts or insight on the issues raised here. Thanks & best wishes, -- Joe
// ... everybody needs good neighbours ... import std.datetime, std.stdio; import dgraph.graph, dgraph.test.samplegraph50; size_t checkNeighbours(bool directed)(ref Graph!directed g) { size_t neighbourSum = 0; foreach(vertex; 0 .. g.vertexCount) { static if (directed) { foreach(n; g.neighboursIn(vertex)) { neighbourSum += n; } foreach(n; g.neighboursOut(vertex)) { neighbourSum += n; } } else { foreach(n; g.neighbours(vertex)) { neighbourSum += n; } } } return neighbourSum; } void main() { auto g = new Graph!false; g.addVertices(50); // g.addVertices(10_000); size_t neighbourSum = 0; foreach(i; 0 .. sampleGraph50.length / 2) { g.addEdge(sampleGraph50[2*i], sampleGraph50[2 * i + 1]); } /* foreach(i; 0 .. sampleGraph10k.length / 2) { g.addEdge(sampleGraph10k[2*i], sampleGraph10k[2 * i + 1]); }*/ writeln("Vertex count: ", g.vertexCount); writeln("Edge count: ", g.edgeCount); StopWatch watch; watch.start; foreach(i; 0 .. 500_000) { neighbourSum += checkNeighbours(g); } watch.stop; writeln("Done in ", watch.peek.msecs, " ms."); writeln("Neighbour sum = ", neighbourSum); }