Hello Giraphers, I have a few comments about the current design of Giraph regarding the implicit creation of vertices. As it's currently designed, if you send a message to a non-existent vertices, Giraph creates it for you. Although I can understand it can get handy as it allows for lazy dataset creation, I think it comes at some cost and I believe this cost is bigger than the advantage:
1) it overlaps the mutation API, where a vertex can be created explicitly when the semantics of the algorithm require it, with knowledge about what's going on and with explicit state. This is an ambiguous and unclear part of the API which is difficult for me to justify and probably confusing for the user too. Which brings me to the second point. 2) it requires a different, and partially duplicate,code path for mutations and implicit vertex creation in our code, as it's clear by looking at BasicRPCCommunication and as it's been experienced currently by me in the email I recently sent to the list. Which brings me to the third point. 3) in order to manage this, for every message we have to hit, sooner or later, the Worker vertices set to see if the vertex is existing and whether it should be implicitly created. This is computationally expensive both if you have a HashMap but also if you have a TreeMap for range partitioning. Also, if we're going to create more exotic partitioning (topology-partitioning?), we're going to hit the problem more. In general, I don't know any graph API that doesn't require to either list explicitly the vertex set at load or to create the vertex explicitly through API. As I said, I understand it allows for lazy creation of the input file, with possibly missing vertices explicitly enlisted (missing as a source vertex but existing as an endpoint for an edge), but this could be really fixed robustly by a single MapReduce job. What do you guys think? -- Claudio Martella claudio.marte...@gmail.com