Hi, some time ago I was starting work on visualization of graph data, stored in Hadoop via Gephi. A first draft of results is here in this blog post: http://blog.cloudera.com/blog/2014/05/how-to-manage-time-dependent-multilayer-networks-in-apache-hadoop/ We found, to handle the metadata for graphs and the appropriate input-converters was the major problem which had to be solved. Now it is easy to retrieve edge and node lists, even for time dependent graphs. The current solution works with Hive or Impala to retrieve the data via JDBC.
But I think, it would be great to have an API in Giraph which allows to trigger a snapshot of the current state of a graph which is processed. After such a snapshot is done the external tool loads this data, e.g. into Gephi. Maybe in a second step, we can just load the data from all worker nodes directly instead of HDFS, but for the beginning it would be fine to use HDFS to decouple the processing layer and the gui. In case of really large graphs, I think a Java-Applet using the "gephi-tools" project could do a great job to render a large graph. The snapshot could be triggered via Zookeeper. A job registers its ability to receive such an optional request. And via Zookeeper a client can find all graphs to look into (based on such a snapshot) and than sends this request. In the next superstep the job looks for the snapshot status in Zookeeper, creates one or just precedes and so on. This would even allow to export time dependent intermediate results from running graph algorithms without a new start. What do you think about such a feature? I think it is also related to the "graph centric API", propsed a while ago. Is it worth a JIRA and do you see use cases for this feature? Best wishes, Mirko
