[jira] [Commented] (GIRAPH-180) Publish SNAPSHOTs and released artifacts in the Maven repository

2012-04-17 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255465#comment-13255465
 ] 

Benjamin Heitmann commented on GIRAPH-180:
--

+1

That would make it easier for new users, and for general managment of builds 
via maven. 

In addition, the Maven repository which will be used for publishing Giraph 
artefacts
can also be used for publishing Maven archetypes (basically project specific 
skeletton code
without any functionallity). 

I was planning on making an initial version of a Maven archetype for a small 
Giraph job sometime in the next weeks. 

> Publish SNAPSHOTs and released artifacts in the Maven repository
> 
>
> Key: GIRAPH-180
> URL: https://issues.apache.org/jira/browse/GIRAPH-180
> Project: Giraph
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.1.0
>Reporter: Paolo Castagna
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently Giraph uses Maven to drive its build.
> However, no Maven artifacts nor SNAPSHOTs are published in the Apache Maven 
> repository or Maven central.
> It would be useful to have Apache Giraph artifacts and SNAPSHOTs published 
> and enable people to use Giraph without recompiling themselves.
> Right now users can checkout Giraph, mvn install it and use this for their 
> dependency:
> 
>   org.apache.giraph
>   giraph
>   0.2-SNAPSHOT
> 
> So, it's not that bad, but it can be better. :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-180) Publish SNAPSHOTs and released artifacts in the Maven repository

2012-04-17 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255466#comment-13255466
 ] 

Benjamin Heitmann commented on GIRAPH-180:
--

The general process for publishing a Maven archetype (via a catalog file) is 
outlined here: 
http://www.sonatype.com/books/mvnref-book/reference/archetype-sect-publishing.html

> Publish SNAPSHOTs and released artifacts in the Maven repository
> 
>
> Key: GIRAPH-180
> URL: https://issues.apache.org/jira/browse/GIRAPH-180
> Project: Giraph
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.1.0
>Reporter: Paolo Castagna
>Priority: Minor
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Currently Giraph uses Maven to drive its build.
> However, no Maven artifacts nor SNAPSHOTs are published in the Apache Maven 
> repository or Maven central.
> It would be useful to have Apache Giraph artifacts and SNAPSHOTs published 
> and enable people to use Giraph without recompiling themselves.
> Right now users can checkout Giraph, mvn install it and use this for their 
> dependency:
> 
>   org.apache.giraph
>   giraph
>   0.2-SNAPSHOT
> 
> So, it's not that bad, but it can be better. :-)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257426#comment-13257426
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

In addition,
I would like to say that Paolos suggestion of providing some ready made code 
for Pig, HBase and MapReduce for processing RDF sounds like a really great 
contribution. 

Please keep us update onthe progress of that!

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257424#comment-13257424
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

Hello, 
and sorry for being late to contributing to this discussion. 

I am currently using Giraph to implement a graph based recommendation algorithm 
which uses RDF data from DBPedia. I am not sure if that is enough of a use case 
for Paolo. 

Generally speaking, statistical analysis of semantic networks should be the 
most general motivation for using Giraph on RDF. In other words: Since RDF has 
a native graph database model and RDF processing needs to happen on web scale, 
Giraph could be a natural fit for processing RDF, if it would support RDF 
input/ingestion in a native way. 

Regarding the fundamental capabilities required for parsing NTriple files with 
RDF: The TextInputFormat needs a way to retrieve and alter already created 
nodes. Currently the assumption for the TextInputFormat class, is that it will 
get exactly one line for each vertex to create. That one line is assumed to 
hold *all* information necessary to create the vertex. 
However, the NTriples format does not work that way, as it can use multiple 
lines to describe the same subject node. 

I already raised this issue on the user mailing list. (However I did not create 
a Jira issue for it.) This is the fundamental capability which is lacking in 
Giraph. If this is enabled, parsing NTriples will be easy. The starting points 
for the email threads in which this was shortly discussed are in [1] and [2].

AFAIR, Dionysis Logothetis suggested that he may look into adding this 
capability to giraph. So you might want to contact him directly to check on the 
progress. 

Now a few details on how I use RDF data for my Giraph job: 
Currently I use a subset of DBPedia, which is roughly 5.5GB unpacked. 
As this DBPedia subset stays static for all my recommendations, it is enough to 
preprocess it once
using a quite simple MapReduce job. I basically join all lines on the subject 
of the triple, 
and then output the following line for each subject: 
SubjectURI NumberOfOutLinks Predicate1 Object1 ... PredicateN ObjectN
(I call this the RDFAdjacencyCSV ;) 

For my specific algorithm, the direction of the the link in the RDF graph does 
not play any role, 
so for each input triple, I add it once to the subject entity and once to the 
object entity. 

The processing job took two days, but it was my first hadoop programm, so it 
probably was inefficient.
The output size was 6GB. 

For running my algorithm, my Giraph job first loads the complete DBPedia 
dataset in memory. While doing this it also loads the user profiles from via 
DistributedCache.getLocalCacheFiles(conf). This is done in my own custom 
TextVertexInputFormat class. The profiles are used to prime the graph, i.e. to 
identify the starting points for the algorithm. I also need to manage which 
starting points belong to which user profiles.

Challenges which I will have in the near future: 
* Giraph does not seem to scale very well for my kind of data and processing: 
Independent of the number of workers, my Giraph job only uses about 30% of a 24 
node machine. And I would like to utilise all available processing resources.
* Integration of RDF reasoning capabilities: I will need to perform subclass 
reasoning on the DBPedia graph. The most pragmatic solution seems to be, to 
have an external RDF store with reasoning, and to let the Giraph workers be 
able to query the RDF store.


 
[1] 
https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CE5D0BE74-7903-4145-BE10-52CBD6489AC8%40deri.org%3E
[2] 
https://mail-archives.apache.org/mod_mbox/incubator-giraph-user/201203.mbox/%3CC6DA4465-B387-474A-B823-84019967DA3E%40deri.org%3E

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into 

[jira] [Commented] (GIRAPH-170) Workflow for loading RDF graph data into Giraph

2012-04-19 Thread Benjamin Heitmann (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/GIRAPH-170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13257433#comment-13257433
 ] 

Benjamin Heitmann commented on GIRAPH-170:
--

Regarding GIRAPH-141, 
I don't think that true multigraph support is required for Giraph in order to 
use RDF data. 

If I have "subject1 predicate1 object1" and "subject1 predicate1 object2", then 
there will be a total of three vertices with 2 edges, without any conflict. If 
I have the same triple "subject1 predicate1 object1" two or more times, then 
the RDF semantics document states that all of these triples refer to the same 
two vertices and the edge between them in the RDF graph. So there is no need 
for a multigraph again. 

If we introduce literals into the mix, then we have the same thing as above, if 
each literal will be presented by its own Giraph vertex. 

I am not sure if I missed anything, but multigraphs dont seem to be the issue 
here, neither in theory, nor for my already working code. 

An issue which would be more important, is the capability to retrieve and 
modify an already created node from inside the TextVertexInputFormat class (as 
explained above). 

> Workflow for loading RDF graph data into Giraph
> ---
>
> Key: GIRAPH-170
> URL: https://issues.apache.org/jira/browse/GIRAPH-170
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Dan Brickley
>Priority: Minor
>
> W3C RDF provides a family of Web standards for exchanging graph-based data. 
> RDF uses sets of simple binary relationships, labeling nodes and links with 
> Web identifiers (URIs). Many public datasets are available as RDF, including 
> the "Linked Data" cloud (see http://richard.cyganiak.de/2007/10/lod/ ). Many 
> such datasets are listed at http://thedatahub.org/
> RDF has several standard exchange syntaxes. The oldest is RDF/XML. A simple 
> line-oriented format is N-Triples. A format aligned with RDF's SPARQL query 
> language is Turtle. Apache Jena and Any23 provide software to handle all 
> these; http://incubator.apache.org/jena/ http://incubator.apache.org/any23/
> This JIRA leaves open the strategy for loading RDF data into Giraph. There 
> are various possibilites, including exploitation of intermediate 
> Hadoop-friendly stores, or pre-processing with e.g. Pig-based tools into a 
> more Giraph-friendly form, or writing custom loaders. Even a HOWTO document 
> or implementor notes here would be an advance on the current state of the 
> art. The BluePrints Graph API (Gremlin etc.) has also been aligned with 
> various RDF datasources.
> Related topics: multigraphs https://issues.apache.org/jira/browse/GIRAPH-141 
> touches on the issue (since we can't currently easily represent fully general 
> RDF graphs since two nodes might be connected by more than one typed edge). 
> Even without multigraphs it ought to be possible to bring RDF-sourced data
> into Giraph, e.g. perhaps some app is only interested in say the Movies + 
> People subset of a big RDF collection.
> From Avery in email: "a helper VertexInputFormat (and maybe 
> VertexOutputFormat) would certainly [despite GIRAPH-141] still help"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira