[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-24 Thread Benjamin Heitmann (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13260512#comment-13260512
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

Thanks guys for your comments!

Paolo: I will take a look at Jena RIOT for inferencing. 

Sebastian: I did not know that it is possible to assign one mapper to each core 
in Hadoop, I will try that for sure. Also, my algorithm does only use a part of 
the graph when it runs. So that might be the easiest explanation for the 
observed behavior. 

Claudio: Thanks for the suggestion, I will further investigate the issue, and 
provide an update when I know whats going on. 

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257426#comment-13257426
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

In addition,
I would like to say that Paolos suggestion of providing some ready made code 
for Pig, HBase and MapReduce for processing RDF sounds like a really great 
contribution. 

Please keep us update onthe progress of that!

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257424#comment-13257424
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

Hello, 
and sorry for being late to contributing to this discussion. 

I am currently using Giraph to implement a graph based recommendation algorithm 
which uses RDF data from DBPedia. I am not sure if that is enough of a use case 
for Paolo. 

Generally speaking, statistical analysis of semantic networks should be the 
most general motivation for using Giraph on RDF. In other words: Since RDF has 
a native graph database model and RDF processing needs to happen on web scale, 
Giraph could be a natural fit for processing RDF, if it would support RDF 
input/ingestion in a native way. 

Regarding the fundamental capabilities required for parsing NTriple files with 
RDF: The TextInputFormat needs a way to retrieve and alter already created 
nodes. Currently the assumption for the TextInputFormat class, is that it will 
get exactly one line for each vertex to create. That one line is assumed to 
hold *all* information necessary to create the vertex. 
However, the NTriples format does not work that way, as it can use multiple 
lines to describe the same subject node. 

I already raised this issue on the user mailing list. (However I did not create 
a Jira issue for it.) This is the fundamental capability which is lacking in 
Giraph. If this is enabled, parsing NTriples will be easy. The starting points 
for the email threads in which this was shortly discussed are in [1] and [2].

AFAIR, Dionysis Logothetis suggested that he may look into adding this 
capability to giraph. So you might want to contact him directly to check on the 
progress. 

Now a few details on how I use RDF data for my Giraph job: 
Currently I use a subset of DBPedia, which is roughly 5.5GB unpacked. 
As this DBPedia subset stays static for all my recommendations, it is enough to 
preprocess it once
using a quite simple MapReduce job. I basically join all lines on the subject 
of the triple, 
and then output the following line for each subject: 
SubjectURI NumberOfOutLinks Predicate1 Object1 ... PredicateN ObjectN
(I call this the RDFAdjacencyCSV ;) 

For my specific algorithm, the direction of the the link in the RDF graph does 
not play any role, 
so for each input triple, I add it once to the subject entity and once to the 
object entity. 

The processing job took two days, but it was my first hadoop programm, so it 
probably was inefficient.
The output size was 6GB. 

For running my algorithm, my Giraph job first loads the complete DBPedia 
dataset in memory. While doing this it also loads the user profiles from via 
DistributedCache.getLocalCacheFiles(conf). This is done in my own custom 
TextVertexInputFormat class. The profiles are used to prime the graph, i.e. to 
identify the starting points for the algorithm. I also need to manage which 
starting points belong to which user profiles.

Challenges which I will have in the near future: 
* Giraph does not seem to scale very well for my kind of data and processing: 
Independent of the number of workers, my Giraph job only uses about 30% of a 24 
node machine. And I would like to utilise all available processing resources.
* Integration of RDF reasoning capabilities: I will need to perform subclass 
reasoning on the DBPedia graph. The most pragmatic solution seems to be, to 
have an external RDF store with reasoning, and to let the Giraph workers be 
able to query the RDF store.


 
[1] 
https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CE5D0BE74-7903-4145-BE10-52CBD6489AC8%40deri.org%3E
[2] 
https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CC6DA4465-B387-474A-B823-84019967DA3E%40deri.org%3E

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are 

[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257433#comment-13257433
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

Regarding GIRAPH-141, 
I don't think that true multigraph support is required for Giraph in order to 
use RDF data. 

If I have subject1 predicate1 object1 and subject1 predicate1 object2, then 
there will be a total of three vertices with 2 edges, without any conflict. If 
I have the same triple subject1 predicate1 object1 two or more times, then 
the RDF semantics document states that all of these triples refer to the same 
two vertices and the edge between them in the RDF graph. So there is no need 
for a multigraph again. 

If we introduce literals into the mix, then we have the same thing as above, if 
each literal will be presented by its own Giraph vertex. 

I am not sure if I missed anything, but multigraphs dont seem to be the issue 
here, neither in theory, nor for my already working code. 

An issue which would be more important, is the capability to retrieve and 
modify an already created node from inside the TextVertexInputFormat class (as 
explained above). 

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Claudio Martella
As said in the other thread, you're missing the option subj1
predicate1 object1 and subj1 predicate2 object1.

That makes it a multi-graph.

On Thu, Apr 19, 2012 at 1:22 PM, Benjamin Heitmann (Commented) (JIRA)
j...@apache.org wrote:

    [ 
 https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257433#comment-13257433
  ]

 Benjamin Heitmann commented on GIRAPH-170:
 --

 Regarding GIRAPH-141,
 I don't think that true multigraph support is required for Giraph in order to 
 use RDF data.

 If I have subject1 predicate1 object1 and subject1 predicate1 object2, 
 then there will be a total of three vertices with 2 edges, without any 
 conflict. If I have the same triple subject1 predicate1 object1 two or more 
 times, then the RDF semantics document states that all of these triples refer 
 to the same two vertices and the edge between them in the RDF graph. So there 
 is no need for a multigraph again.

 If we introduce literals into the mix, then we have the same thing as above, 
 if each literal will be presented by its own Giraph vertex.

 I am not sure if I missed anything, but multigraphs dont seem to be the issue 
 here, neither in theory, nor for my already working code.

 An issue which would be more important, is the capability to retrieve and 
 modify an already created node from inside the TextVertexInputFormat class 
 (as explained above).

 Workflow for loading RDF graph data into Giraph
 ---

                 Key: GIRAPH-170
                 URL: https://issues.apache.org/jira/browse/GIRAPH-170
             Project: Giraph
          Issue Type: New Feature
            Reporter: Dan Brickley
            Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully 
 general RDF graphs since two nodes might be connected by more than one typed 
 edge). Even without multigraphs it ought to be possible to bring RDF-sourced 
 data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

 --
 This message is automatically generated by JIRA.
 If you think it was sent incorrectly, please contact your JIRA 
 administrators: 
 https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
 For more information on JIRA, see: http://www.atlassian.com/software/jira





-- 
   Claudio Martella
   claudio.marte...@gmail.com


[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257695#comment-13257695
 ] 

Paolo Castagna commented on GIRAPH-170:
---

Hi Benjamin

 I call this the RDFAdjacencyCSV

We came to the same conclusion. I ended up using Turtle for this, as explained 
here: 
http://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201204.mbox/%3C4F84872E.4050101%40googlemail.com%3E

Turtle isn't splittable in general, but it can be made so simply writing all 
the RDF statements with the same subject on a single line.

 I would like to say that Paolos suggestion of providing some ready made code 
 for Pig, HBase and MapReduce for processing RDF sounds like a really great 
 contribution. 

I am not sure what's the best place to put such code, I started with sharing 
small examples and experiments on GitHub, here: 
https://github.com/castagna/jena-grande

 Integration of RDF reasoning capabilities: I will need to perform subclass 
 reasoning on the DBPedia graph.

See Apache Jena's RIOT infer command or a MapReduce version of it, here: 
https://github.com/castagna/tdbloader4/blob/master/src/main/java/org/apache/jena/tdbloader4/InferDriver.java

I wonder if Giraph could be used to implement the RETE algorithm 
(http://en.wikipedia.org/wiki/Rete_algorithm) which is what Jena uses (with in 
memory RDF Jena models).

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Sebastian Schelter (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257702#comment-13257702
 ] 

Sebastian Schelter commented on GIRAPH-170:
---

??Independent of the number of workers, my Giraph job only uses about 30% of a 
24 node machine. And I would like to utilise all available processing 
resources.??

It surprises me, that you don't get a higher load. If you configure your 
cluster to use one worker/map instance per core you should get a much higher 
CPU load. Could it be that either the cluster is too powerful for your graph or 
that your algorithm doesn't work on the whole graph all the time?

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Claudio Martella (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13257765#comment-13257765
 ] 

Claudio Martella commented on GIRAPH-170:
-

Difficult to say without further investigation (and without a proper definition 
of 30%) but it could also be that you're stuck in I/O (messaging, 
checkpointing).

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-08 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13249601#comment-13249601
 ] 

Paolo Castagna commented on GIRAPH-170:
---

Pig and Pig Latin can certainly be used to create adjacency lists from RDF in 
N-Triples|N-Quads format.
I tend to use more plain MapReduce jobs written in Java, but I found a very old 
(i.e. it was using Pig version 0.6) example on how one might write an 
[NQuadsStorage|https://github.com/castagna/running-pig/blob/e4d12b377ee06f80be7e58d2af628028df9b2b07/src/main/java/com/talis/pig/NQuadsStorage.java]
 which implements LoadFunc and StoreFunc for Pig. I shared it, even if it does 
not even compile now, just to show how trivial that is.

It is my intention, in the next few weeks, to create a small library to support 
people wanting to use Pig, HBase, MapReduce and Giraph to process RDF data.
For Pig the first (and only?) thing to do is to implement LoadFunc and 
StoreFunc for RDF data. It seems possible (although not easy) to map the SPARQL 
algebra to Pig Latin physical operators (and SPARQL property paths to Giraph 
jobs? ;-)), that would provide a good and scalable batch processing solution 
for those into SPARQL. 
For HBase, the first step is to store RDF data, even a plain [(G)|S|P|O] 
solution would do initially.
For MapReduce, blank nodes can be painful, I have some tricks to share here. 
Input/output formats and record readers/writers, etc.

In relation to Giraph, to bring the discussion on topic, until I am proven 
wrong, I am going for the adjacency list approach as discussed above and do 
graph processing as other 'usual' Giraph jobs.

The question: what are the RDF processing use cases which are a good fit for 
Giraph is still open for me (and I'll find out soon).

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247490#comment-13247490
 ] 

Dan Brickley commented on GIRAPH-170:
-

From Paulo in email:

 I suspect N-Triples | N-Quads might not be the best option for
something like Giraph. Something more like an adjacency list might be
better.

So, my intuition, is that if you start with RDF in N-Triples format,
the first step would be a simple MapReduce job to group RDF statements
by subject (eventually filtering out certain properties):

Input:

 s1 --p1-- o1
 s1 --p2-- o2
 s1 --p2-- o3
 s2 ...

Output (adjacency list):

 s1 (p1 o1) (p2 o2) (p2 o3)
 s2 ...

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Dan Brickley (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247506#comment-13247506
 ] 

Dan Brickley commented on GIRAPH-170:
-

Another architectural note around RDF:

RDF is basically simple factual data expressed as sets of binary relationships. 
In that sense it is a graph directly, already. 

However often RDF describes something that is in a deeper sense also a graph. 
Common examples include FOAF, where node and edge types (Person, Document, 
Group, etc.) can express matrix of collaboration, social linkage, etc. Or from 
DBpedia.org, Freebase etc., we have for example datasets of movies and actors. 
In the dbpedia case, it's simple enough; a movie node, an actor node, and a 
typed link between them. Freebase by contrast, reifies the 'starring' 
relationship into another node, ... so you can represent dates, character name 
etc. This sort of meta-information (properties of links) is also btw in the 
BluePrints/Gremlin API.

One point here is that a 'starring' link pointing from a Movie to an Actor, 
tells us the same, but in reverse, as what we would have learned from a 
'starsIn' link from the Actor to the Movie. For Giraph we may want to consider 
therefore adding backlinks so each node is equally aware of properties pointing 
both in, and out.


 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-05 Thread Paolo Castagna (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13247542#comment-13247542
 ] 

Paolo Castagna commented on GIRAPH-170:
---

bq. we may want to consider therefore adding backlinks

Yep. I'd like to better understand what people currently do if they need 
incoming and outgoing links for their processing.
An adjacency list can be constructed listing incoming (a.k.a. backlinks) as 
well as outgoing links, in one MapReduce job.

Input:

s1 -p1- o1
s1 -p2- o2
s1 -p2- o3
s2 -p1- s1
s2 ...

Output (adjacency list):

s1 (out: p1 o1) (out: p2 o2) (out: p2 o3) (in: s2 p1)
s2 ...

Whether it is better to do it this way or have support from the Giraph APIs 
avoiding an initial MapReduce job to construct the adjacency list, I do not 
know yet.

 Workflow for loading RDF graph data into Giraph
 ---

 Key: GIRAPH-170
 URL: https://issues.apache.org/jira/browse/GIRAPH-170
 Project: Giraph
  Issue Type: New Feature
Reporter: Dan Brickley
Priority: Minor

 W3C RDF provides a family of Web standards for exchanging graph-based data. 
 RDF uses sets of simple binary relationships, labeling nodes and links with 
 Web identifiers (URIs). Many public datasets are available as RDF, including 
 the Linked Data cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
 such datasets are listed at http://thedatahub.org/
 RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
 line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
 language is Turtle. Apache Jena and Any23 provide software to handle all 
 these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
 This JIRA leaves open the strategy for loading RDF data into Giraph. There 
 are various possibilites, including exploitation of intermediate 
 Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
 more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
 or implementor notes here would be an advance on the current state of the 
 art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
 various RDF datasources.
 Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
 touches on the issue (since we can't currently easily represent fully general 
 RDF graphs since two nodes might be connected by more than one typed edge). 
 Even without multigraphs it ought to be possible to bring RDF-sourced data
 into Giraph, e.g. perhaps some app is only interested in say the Movies + 
 People subset of a big RDF collection.
 From Avery in email: a helper VertexInputFormat (and maybe 
 VertexOutputFormat) would certainly [despite GIRAPH-141] still help

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira