On Jun 24, 5:58 am, "A.T.Hofkamp" <[EMAIL PROTECTED]> wrote: > On 2008-06-24, MRAB <[EMAIL PROTECTED]> wrote: > > > On Jun 24, 1:26 am, [EMAIL PROTECTED] wrote: > >> I need to represent the hyperlinks between a large number of HTML > >> files as a graph. My non-directed graph will have about 63,000 nodes > >> and and probably close to 500,000 edges. > > >> I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/ > >> python/index.html) and networkX (https://networkx.lanl.gov/wiki) for > >> generating a file to store the graph, and I have also looked into > >> Graphviz for visualization. I'm just not sure which modules are > >> best. I need to be able to do the following: > > Afaik Graphviz is not good at abstracting the graph, which you may need here. > A page with 63,000 circles on it, and 500,000 edges will probably come out of > the printer as a black sheet of paper. > (8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet. > You need a factor 28 more circles which leads to a circle of about 0.007".) > > Even when actual paper format would not be a problem, you will need some > abstraction and/or tools, as you will not find anything manually in an ocean > of > 63,000 elements. > > One area you may want to start looking for tools is in state graphs, where the > set of possible states of an entire system is unfolded. These things go up to > over a million states, so you only have a 'small' problem there... > > > > > > >> 1) The names of my nodes are not known ahead of time, so I will > >> extract the title from all the HTML files to name the nodes prior to > >> parsing the files for hyperlinks (edges). > > >> 2) Every file will be parsed for links and nondirectional connections > >> will be drawn between the two nodes. > > >> 3) The files might link to each other so the graph package needs to > >> be able to check to see if an edge between two nodes already exists, > >> or at least not double draw connections between the two nodes when > >> adding edges. > > >> I'm relatively new to graph theory so I would greatly appreciate any > >> suggestions for filetypes. I imagine doing this as a python > >> dictionary with a list for the edges and a node:list paring is out of > >> the question for such a large graph? > > > Perhaps a dictionary where the key is a node and the value is a set of > > destination nodes? > > For undirected edges, you could make an Edge class and have a set of Edge's > (where two Edge objects are equal when they connect the same nodes). > I don't expect 500,000 elements in a set to be a problem. > > Sincerely, > Albert- Hide quoted text - > > - Show quoted text -
Readability counts: Do you need to be able to examine the datafile by hand? If not, investigate the 'shelve' module. You can go two ways. One is to map each node to a list of nodes. It's probably more intuitive, but it needs repeated code: shelf['pageA'].add( pageB ) shelf['pageB']= pageB shelf['pageB'].add( pageA ) Which is incorrect as is anyway-- updates to shelved objects aren't committed automatically. a= shelf['pageA'] a.add( pageB ) shelf['pageA']= a b= shelf['pageB'] b.add( pageA ) shelf['pageB']= b is closer. Otherwise, two is to store a 2-tuple of node keys in a secondary file. shelfEdges[ frozenset( shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ] ) ]= True -- http://mail.python.org/mailman/listinfo/python-list