Re: Merging graphs

A. Soroka Wed, 21 Dec 2016 06:47:36 -0800

Sure, that makes sense. And as Andy says, a union won't copy data.

---
A. Soroka
The University of Virginia Library


> On Dec 21, 2016, at 9:43 AM, George News <george.n...@gmx.net> wrote:
> 
> On 21/12/2016 14:17, A. Soroka wrote:
>> DatsetGraph/Graph implementations are smart enough not to store
>> duplicate tuples. So adding (let's say) a graph with 50 triples to a
>> graph with 50 triples, of which 25 are common between the two, should
>> result in a graph with 75 triples to be searched. On the other hand,
>> a union graph between the two will have to search 100 triples. Is
>> that what you mean?
> 
> No. The graphs I'm merging will probably have less than 1% or triples
> duplicated (if the users do things as expected).
> 
> The issue is that I want to merge just only for some SPARQL searches.
> Therefore I don't want a deep copy of the graphs, I just want to use the
> original graphs but as one. Let's say it's like pointers in C, I just
> one to keep the original graphs' pointers so no replication of data is done.
> 
> For that, based on A.Seaborne answer it seems that createUnion will make
> it. I don't know what is the usual, but lets say I will have to deal
> with several thousands of triples, and this is why I don't want to copy
> them again and again.
> 
> Hope the explanation is ok.
> 
> Regards,
> Jorge
> 
>> --- A. Soroka The University of Virginia Library
>> 
>>> On Dec 21, 2016, at 8:13 AM, George News <george.n...@gmx.net>
>>> wrote:
>>> 
>>> 
>>> On 21/12/2016 13:54, Andy Seaborne wrote:
>>>> 
>>>> 
>>>> On 21/12/16 12:31, George News wrote:
>>>>> Hi,
>>>>> 
>>>>> Today is the day of questions to the mailing list ;) Sorry for
>>>>> the "spam" ;)
>>>>> 
>>>>> I would like to know what is the internal implementation of
>>>>> the functions used for merging graphs.
>>>>> 
>>>>> 1) ModelFactory.createUnion(Model m1, Model m2) It seems from
>>>>> what I have read and inferred from some websites that there is
>>>>> not an actual copy of data on a new graph. It is more that 
>>>>> internally the graph pointers (like in C) are linked, but the
>>>>> data is the original one and not copied. Is that right?
>>>> 
>>>> Correct - it is a new model that internally provides the union
>>>> view of two other models.
>>> 
>>> Great, no copy then ;)
>>> 
>>>>> 
>>>>> 2) org.apache.jena.graph.compose.MultiUnion How is the
>>>>> addGraph() works? Is it copying the original graph or it is 
>>>>> just linking the data? I'm confused by the help : " Note that
>>>>> the requirement to remove duplicates from the union means that
>>>>> this will be an expensive operation for large (and especially
>>>>> for persistent) graphs. "
>>>> 
>>>> That comment is on find()
>>> 
>>> Upss my fault. You are completely right :(
>>> 
>>>> A graph is a set of triples - the key here is "set" - only one
>>>> instance.
>>>> 
>>>> To make that appear to be true in the union, the code needs to
>>>> remember what it has iterated over.  if it is going (in extreme) 
>>>> find(null,null,null)  that's a lot of space.
>>>> 
>>>> 
>>>> 
>>>>> Besides, how do I retrieved the merged/joint graph? Do I have
>>>>> to use option 1) in an iterative way, reusing the returned
>>>>> graph to add the additional one?
>>>> 
>>>> add(Model) copies the one model into another - a true merge.
>>> 
>>> That was what I thought. Now the confirmation from experts ;)
>>> 
>>>> 
>>>> from your previous question, you don't want this - you want
>>>> TDB's "default union graph" mode.  It's a lot cheaper at scale.
>>>> 
>>>> https://jena.apache.org/documentation/tdb/datasets.html
>>> 
>>> I already have that for the whole dataset. However I was thinking
>>> on creating smaller named graphs. In my mind, this is going to make
>>> SPARQL sentences and calls to Jena API quicker as the bunch of data
>>> where to search from is smaller. Is this right?
>>> 
>>> If it is I was thinking, based also on your response, to create a
>>> Model that is the union of all the ones I want (which should be
>>> quick), and the use this Model as the input for the SPARQL engine.
>>> 
>>> Besides, I was thinking also on having multiple datasets (TDB) but
>>> I don't now if that would make any sense.
>>> 
>>> The issue is that the amount of data that I will have to handle is
>>> quite huge, and I want as much as possible, to make the searchable
>>> sets the smaller possible.
>>> 
>>>>> 
>>>>> Thanks in advance for the help. Jorge
>>>>> 
>>>> 
>> 
>>

Re: Merging graphs

Reply via email to