Hi Benjamin,

Thanks for sharing your ideas.
First, let me clarify that my proposal does not aim at excluding any of the
current use cases from Giraph. I just would like to improve some common use
cases that come up very often.

Now, on the technical side of processing RDF.
The type of a vertex/edge could be represented as an enum which is an
integer number in the end.
I guess that having a central String<->ID map per worker is more memory
efficient than representing explicitly the vertex/edge attributes as
Strings, up to any reasonable distribution of the vertexes I can think of
(i.e. if you have 1 vertex per machine then of course it is worse, but
that's pathological anyway).
If creating this kind of dictionary is easy (or even already done) for the
user, and there is a well defined and documented idiom to do this kind of
computation, I don't think users will have a problem with filling in some
structure in the worker rather than implementing a vertex input format.

Custom state in vertexes is for sure something I would like to keep in
Giraph (at the expense of higher memory footprint).
On the other hand, Giraph is about big data, so it is suboptimal that to
equal the performance that I get on a single machine I need to use 10
machines in Giraph, just because I need to load the graph uncompressed.
I would also like to see improvements for common simple cases that would
make Giraph more useful in practice.

The right API to allow all these cases should come out of this discussion.

Cheers,
--
Gianmarco



On Mon, Aug 20, 2012 at 5:44 PM, Benjamin Heitmann <
[email protected]> wrote:

>
> Hello, just a few in-line comments regarding the simplification of vertex
> classes.
>
> In my opinion the proposed change might exclude all typed graphs, and all
> Sematic Web style processing from Giraph.
>
> On 17 Aug 2012, at 14:30, Gianmarco De Francisci Morales wrote:
>
> > In any case, if one wanted to use a compressed memory representation by
> > aggregating different edge lists together, could one use the worker
> context
> > as a central point of access to the compressed graphs?
> > I can imagine a vertex class that has only the ID and uses the worker
> > context to access its edge list (i.e. it is only a client to a central
> > per-machine repository).
> > Vertexes in the same partition would share this data structure.
>
> In the current vertex class signature, every user vertex can choose to
> have a complex class to hold the state of the vertex.
>
> Will that capability be gone with this proposed simplification of a vertex
> to only hold an id and a list of neighbour vertices?
>
> While most of the popular graph algorithms only take the graph itself into
> account, there are types of algorithm which also can take the semantics of
> the graph, of a node and of an edge into account. Basically everything from
> the area of Semantic Web graph analysis falls in this area, and one
> specific type of algorithm is spreading activation.
>
> In a nut-shell, spreading activation is a breadth first search which is
> guided by the semantics of the vertices and edges.
>
> An example: return all persons and posts which are somehow related to this
> one person. In addition, all vertices which are not persons or posts, and
> give twice as much weight in the ranking to properties from the music
> domain (all other properties have normal weight).
>
> If semantics can not be stored as part of a vertex or an edge, then this
> would require an external database lookup for each compute() call to a
> vertex. That would basically eliminate all reasons to use giraph for this
> kind of algorithm.
>
> > Is there any obvious technical fallacy in this scheme?
>
> Not a technical fallacy, but I would argue that a lot will be lost by not
> giving developers a mechanism for including custom state in their vertex.
> Of course, developers need to be aware that this will increase the memory
> footprint of their objects, and I guess serialising/deserialising of
> strings will be a huge issue.
>
> But that should not be a reason to completely exclude such algorithms from
> using giraph.
> Or to exclude any kind of typed graph, semantic network from using giraph.

Reply via email to