Hi Dan,
I have not an answer to your questions/observations yet.

However, I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

  s1 --p1--> o1
  s1 --p2--> o2
  s1 --p2--> o3
  s2 ...

Output (adjacency list):

  s1 (p1 o1) (p2 o2) (p2 o3)
  s2 ...

But, as I said, is it too early for me to say definitely this is the
best approach.

Paolo

Dan Brickley wrote:
> On 5 April 2012 05:49, Jakob Homan <jgho...@gmail.com> wrote:
>> Ack!, I suck.  Sorry.  I hadn't realized we'd gone through most of
>> them, which itself is a good thing.  I'll get some new ones added
>> first thing in the morning.  Sorry.
> 
> Do we have something around "document a workflow to get RDF graph data
> into Giraph?". A few of us have been talking about it here or there,
> and I've heard various strategies mentioned (e.g. Ntriples as it's a
> simple line-oriented format; piggybacking on HBase or other storage
> that Giraph already has adaptors for; integrating Apache Jena; ...). I
> can't find much in JIRA but
> https://issues.apache.org/jira/browse/GIRAPH-141 touches on the issue
> (since we can't currently easily represent fully general RDF graphs
> since two nodes might be connected by more than one typed edge). Even
> without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the
> Movies + People subset of a big RDF collection. And so perhaps most of
> the work is in preprocessing for now - e.g. via Ntriples + Pig; but
> still it would be great to have a clear HOWTO.
> 
> As an interested party on the periphery, a JIRA for this would give a
> natural place to monitor, read up, maybe even help. And I'm sure I'm
> not alone...
> 
> cheers,
> 
> Dan

Reply via email to