I need to represent the hyperlinks between a large number of HTML
files as a graph.  My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization.  I'm just not sure which modules are
best.  I need to be able to do the following:

1)  The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3)  The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes.  I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to