I need to represent the hyperlinks between a large number of HTML files as a graph. My non-directed graph will have about 63,000 nodes and and probably close to 500,000 edges.
I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/ python/index.html) and networkX (https://networkx.lanl.gov/wiki) for generating a file to store the graph, and I have also looked into Graphviz for visualization. I'm just not sure which modules are best. I need to be able to do the following: 1) The names of my nodes are not known ahead of time, so I will extract the title from all the HTML files to name the nodes prior to parsing the files for hyperlinks (edges). 2) Every file will be parsed for links and nondirectional connections will be drawn between the two nodes. 3) The files might link to each other so the graph package needs to be able to check to see if an edge between two nodes already exists, or at least not double draw connections between the two nodes when adding edges. I'm relatively new to graph theory so I would greatly appreciate any suggestions for filetypes. I imagine doing this as a python dictionary with a list for the edges and a node:list paring is out of the question for such a large graph? -- http://mail.python.org/mailman/listinfo/python-list