Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-04-01 Thread Dick Murray
Hi.

I've pushed up a draft to https://github.com/dick-twocows/jena-dev.git.

This has two test cases;

Echo : which will echo back the find GSPO call i.e. call find ABCD and you
will get the Quad ABCD back. This does not cache between calls.

CSV : which will transform a CSV file into Quads i.e. find GSPO will open
the CSV by mangling the G and cache against G ANY ANY ANY. This does cache
between calls i.e. the CSV is transformed once.

Will look at a simple JDBC test over the weekend if I get the time...

It has a POM so should Maven.

Comments appreciated (I've probably hard coded something).

Dick.

On 30 March 2016 at 20:39, Andy Seaborne  wrote:

> On 29/03/16 12:23, Joint wrote:
>
>>
>>
>> Yep, that's mangled. I've refactored the code into a Jena package do
>> you want me to create a patch for testing or it can be pulled from my
>> github?
>>
>>
>> Dick
>>
>
> 
> One of the things any open source project has to manage is whether
> accepting a contribution is the right thing to do - factors like who will
> maintain it come in.  Sometimes it better to have a module, sometimes it is
> better to have a related project. Jena has kept lists of related project
> before - it get out-of-date as no body wants to remove a live, albeit
> quiet, project.
>
>
> So there's two steps - understand what the code does and then whether the
> right thing to do is incorporate it.
>
> Post the github URL and we can look.
>
> For a contribution it is better if it is pushed to the project in some way
> (e.g. patch on JIRA, github PR) even if Apache Licensed.  Community over
> code.
>
> Andy
>


Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-29 Thread Joint


Yep, that's mangled.
I've refactored the code into a Jena package do you want me to create a patch 
for testing or it can be pulled from my github?


Dick

 Original message 
From: Andy Seaborne <a...@apache.org> 
Date: 29/03/2016  10:02 am  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory 

On 16/03/16 20:05, Dick Murray wrote:
> Right, I think I cracked it! :-)
>
> Two classes defined below, one extends DatasetGraphInMemory, one provides a
> small test (basically a quad echo).
>
> Simple overview;
>
> addToNamedGraph writes the quad into a separate QuadTable if the
> transaction is READ otherwise it call super...
>
> findInNamedGraph returns the super find if the transaction is READ
> otherwise it returns the union of the super find and the separate
> QuadTable.find.
>
> end checks if the transaction is READ and separate quad tables exist and if
> they do it begins a WRITE transaction and copies the quads then updates a
> global set of cached quads.
>
> I'm sure this upholds read committed for threads holding READ and also for
> the thread holding the READ which needs to WRITE because of the union.
> Subsequent threads which READ will see the changes of the WRITE after it
> has been committed. The delayed WRITE in the end() will proceed as normal
> WRITE blocking the READ thread from continuing.
>
> Comments please?

Sounds OK.

But the code got mangled :-(

[
Thunderbird-ism? That seems remove indentation on C - it's annoying 
and I haven't found a way to stop it
]

andy

>
> The following class will create the quad it's asked to find the first time
> it is asked to find it.
>
> package org.iungo.dataset;
>
> import org.apache.jena.sparql.core.Quad;
>
> public class DatasetGraphEcho extends DatasetGraphOnDemand {
>
> public DatasetGraphEcho() {
> onDemand.add(new CacheOnDemand() {
> @Override
> public void cache(Quad t) {
> add(t);
> }
> });
> }
> }
>
>
> By extending this class...
>
> package org.iungo.dataset;
>
> import java.util.HashMap;
> import java.util.HashSet;
> import java.util.Iterator;
> import java.util.LinkedList;
> import java.util.List;
> import java.util.Map;
> import java.util.Set;
> import java.util.concurrent.ConcurrentHashMap;
> import java.util.function.BiConsumer;
> import java.util.function.Consumer;
>
> import org.apache.jena.graph.Node;
> import org.apache.jena.graph.compose.CompositionBase;
> import org.apache.jena.query.ReadWrite;
> import org.apache.jena.sparql.core.Quad;
> import org.apache.jena.sparql.core.mem.DatasetGraphInMemory;
> import org.apache.jena.sparql.core.mem.HexTable;
> import org.apache.jena.sparql.core.mem.QuadTable;
> import org.apache.jena.util.iterator.WrappedIterator;
>
> public abstract class DatasetGraphOnDemand extends DatasetGraphInMemory {
>
> protected static class DelayedWrite {
> protected final QuadTable quadTable = new HexTable();
> }
> protected final ThreadLocal<Map<Quad, DelayedWrite>> delayedWrites =
> ThreadLocal.withInitial(() -> new HashMap<>());
> protected static interface OnDemand extends Consumer {
> }
>
> protected Set cached = ConcurrentHashMap.newKeySet();
> protected abstract class CacheOnDemand implements OnDemand {
> @Override
> public void accept(Quad q) {
> if (!cached.contains(q)) {
> cache(q);
> }
> }
> abstract void cache(Quad q);
> }
> protected final List onDemand = new LinkedList<>();
> @Override
> protected void addToDftGraph(Node s, Node p, Node o) {
> throw new UnsupportedOperationException();
> }
>
> @Override
> protected void addToNamedGraph(Node g, Node s, Node p, Node o) {
> if (transactionType().equals(ReadWrite.READ)) {
> Map<Quad, DelayedWrite> m = delayedWrites.get();
> Quad q = new Quad(g, s, p, o);
> DelayedWrite delayedWrite = m.get(q);
> if (delayedWrite == null) {
> delayedWrite = new DelayedWrite();
> m.put(q, delayedWrite);
> }
> delayedWrite.quadTable.add(q);
> } else {
> super.addToNamedGraph(g, s, p, o);
> }
> }
>
> @Override
> protected Iterator findInSpecificNamedGraph(Node g, Node s, Node p,
> Node o) {
> final Quad q = new Quad(g, s, p, o);
> onDemand.forEach(u -> u.accept(q));
> final Iterator i = super.findInSpecificNamedGraph(g, s, p, o);
> if (transactionType().equals(ReadWrite.READ)) {
> DelayedWrite delayedWrite = delayedWrites.get().get(q);
> if (delayedWrite == null) {
> return i;
> } else {
> Set seen = new HashSet<>();
> /*
> * Return the read quads and then the delayedWrite quads dropping any which
> have already been seen so as to perserve the unio

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-29 Thread Andy Seaborne

On 16/03/16 20:05, Dick Murray wrote:

Right, I think I cracked it! :-)

Two classes defined below, one extends DatasetGraphInMemory, one provides a
small test (basically a quad echo).

Simple overview;

addToNamedGraph writes the quad into a separate QuadTable if the
transaction is READ otherwise it call super...

findInNamedGraph returns the super find if the transaction is READ
otherwise it returns the union of the super find and the separate
QuadTable.find.

end checks if the transaction is READ and separate quad tables exist and if
they do it begins a WRITE transaction and copies the quads then updates a
global set of cached quads.

I'm sure this upholds read committed for threads holding READ and also for
the thread holding the READ which needs to WRITE because of the union.
Subsequent threads which READ will see the changes of the WRITE after it
has been committed. The delayed WRITE in the end() will proceed as normal
WRITE blocking the READ thread from continuing.

Comments please?


Sounds OK.

But the code got mangled :-(

[
Thunderbird-ism? That seems remove indentation on C - it's annoying 
and I haven't found a way to stop it

]

andy



The following class will create the quad it's asked to find the first time
it is asked to find it.

package org.iungo.dataset;

import org.apache.jena.sparql.core.Quad;

public class DatasetGraphEcho extends DatasetGraphOnDemand {

public DatasetGraphEcho() {
onDemand.add(new CacheOnDemand() {
@Override
public void cache(Quad t) {
add(t);
}
});
}
}


By extending this class...

package org.iungo.dataset;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.function.BiConsumer;
import java.util.function.Consumer;

import org.apache.jena.graph.Node;
import org.apache.jena.graph.compose.CompositionBase;
import org.apache.jena.query.ReadWrite;
import org.apache.jena.sparql.core.Quad;
import org.apache.jena.sparql.core.mem.DatasetGraphInMemory;
import org.apache.jena.sparql.core.mem.HexTable;
import org.apache.jena.sparql.core.mem.QuadTable;
import org.apache.jena.util.iterator.WrappedIterator;

public abstract class DatasetGraphOnDemand extends DatasetGraphInMemory {

protected static class DelayedWrite {
protected final QuadTable quadTable = new HexTable();
}
protected final ThreadLocal> delayedWrites =
ThreadLocal.withInitial(() -> new HashMap<>());
protected static interface OnDemand extends Consumer {
}

protected Set cached = ConcurrentHashMap.newKeySet();
protected abstract class CacheOnDemand implements OnDemand {
@Override
public void accept(Quad q) {
if (!cached.contains(q)) {
cache(q);
}
}
abstract void cache(Quad q);
}
protected final List onDemand = new LinkedList<>();
@Override
protected void addToDftGraph(Node s, Node p, Node o) {
throw new UnsupportedOperationException();
}

@Override
protected void addToNamedGraph(Node g, Node s, Node p, Node o) {
if (transactionType().equals(ReadWrite.READ)) {
Map m = delayedWrites.get();
Quad q = new Quad(g, s, p, o);
DelayedWrite delayedWrite = m.get(q);
if (delayedWrite == null) {
delayedWrite = new DelayedWrite();
m.put(q, delayedWrite);
}
delayedWrite.quadTable.add(q);
} else {
super.addToNamedGraph(g, s, p, o);
}
}

@Override
protected Iterator findInSpecificNamedGraph(Node g, Node s, Node p,
Node o) {
final Quad q = new Quad(g, s, p, o);
onDemand.forEach(u -> u.accept(q));
final Iterator i = super.findInSpecificNamedGraph(g, s, p, o);
if (transactionType().equals(ReadWrite.READ)) {
DelayedWrite delayedWrite = delayedWrites.get().get(q);
if (delayedWrite == null) {
return i;
} else {
Set seen = new HashSet<>();
/*
* Return the read quads and then the delayedWrite quads dropping any which
have already been seen so as to perserve the union contract.
*/
return CompositionBase.recording(WrappedIterator.create(i), seen).andThen(
WrappedIterator.create(delayedWrite.quadTable.find(g, s, p,
o).iterator()).filterDrop( seen::contains ));
}
} else {
return i;
}
}

@Override
public void end() {
final Boolean applyDelayedWrites = transactionType().equals(ReadWrite.READ)
&& delayedWrites.get().size() > 0;
super.end();
if (applyDelayedWrites) {
begin(ReadWrite.WRITE);
try {
Map m = delayedWrites.get();
m.forEach(new BiConsumer() {
@Override
public void accept(Quad t, DelayedWrite u) {
u.quadTable.begin(ReadWrite.READ);
u.quadTable.find(Node.ANY, Node.ANY, Node.ANY, Node.ANY).forEach(quad ->
add(quad));
u.quadTable.end();
}
});
commit(); // All the delay writes have now been committed to the dataset
graph.
// Add all the cached quads to the global cache
m.keySet().forEach(q -> cached.add(q));
} finally {
super.end();
}
}
}
}




Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-18 Thread Dick Murray
Right, I think I cracked it! :-)

Two classes defined below, one extends DatasetGraphInMemory, one provides a
small test (basically a quad echo).

Simple overview;

addToNamedGraph writes the quad into a separate QuadTable if the
transaction is READ otherwise it call super...

findInNamedGraph returns the super find if the transaction is READ
otherwise it returns the union of the super find and the separate
QuadTable.find.

end checks if the transaction is READ and separate quad tables exist and if
they do it begins a WRITE transaction and copies the quads then updates a
global set of cached quads.

I'm sure this upholds read committed for threads holding READ and also for
the thread holding the READ which needs to WRITE because of the union.
Subsequent threads which READ will see the changes of the WRITE after it
has been committed. The delayed WRITE in the end() will proceed as normal
WRITE blocking the READ thread from continuing.

Comments please?

The following class will create the quad it's asked to find the first time
it is asked to find it.

package org.iungo.dataset;

import org.apache.jena.sparql.core.Quad;

public class DatasetGraphEcho extends DatasetGraphOnDemand {

public DatasetGraphEcho() {
onDemand.add(new CacheOnDemand() {
@Override
public void cache(Quad t) {
add(t);
}
});
}
}


By extending this class...

package org.iungo.dataset;

import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.function.BiConsumer;
import java.util.function.Consumer;

import org.apache.jena.graph.Node;
import org.apache.jena.graph.compose.CompositionBase;
import org.apache.jena.query.ReadWrite;
import org.apache.jena.sparql.core.Quad;
import org.apache.jena.sparql.core.mem.DatasetGraphInMemory;
import org.apache.jena.sparql.core.mem.HexTable;
import org.apache.jena.sparql.core.mem.QuadTable;
import org.apache.jena.util.iterator.WrappedIterator;

public abstract class DatasetGraphOnDemand extends DatasetGraphInMemory {

protected static class DelayedWrite {
protected final QuadTable quadTable = new HexTable();
}
protected final ThreadLocal> delayedWrites =
ThreadLocal.withInitial(() -> new HashMap<>());
protected static interface OnDemand extends Consumer {
}

protected Set cached = ConcurrentHashMap.newKeySet();
protected abstract class CacheOnDemand implements OnDemand {
@Override
public void accept(Quad q) {
if (!cached.contains(q)) {
cache(q);
}
}
abstract void cache(Quad q);
}
protected final List onDemand = new LinkedList<>();
@Override
protected void addToDftGraph(Node s, Node p, Node o) {
throw new UnsupportedOperationException();
}

@Override
protected void addToNamedGraph(Node g, Node s, Node p, Node o) {
if (transactionType().equals(ReadWrite.READ)) {
Map m = delayedWrites.get();
Quad q = new Quad(g, s, p, o);
DelayedWrite delayedWrite = m.get(q);
if (delayedWrite == null) {
delayedWrite = new DelayedWrite();
m.put(q, delayedWrite);
}
delayedWrite.quadTable.add(q);
} else {
super.addToNamedGraph(g, s, p, o);
}
}

@Override
protected Iterator findInSpecificNamedGraph(Node g, Node s, Node p,
Node o) {
final Quad q = new Quad(g, s, p, o);
onDemand.forEach(u -> u.accept(q));
final Iterator i = super.findInSpecificNamedGraph(g, s, p, o);
if (transactionType().equals(ReadWrite.READ)) {
DelayedWrite delayedWrite = delayedWrites.get().get(q);
if (delayedWrite == null) {
return i;
} else {
Set seen = new HashSet<>();
/*
* Return the read quads and then the delayedWrite quads dropping any which
have already been seen so as to perserve the union contract.
*/
return CompositionBase.recording(WrappedIterator.create(i), seen).andThen(
WrappedIterator.create(delayedWrite.quadTable.find(g, s, p,
o).iterator()).filterDrop( seen::contains ));
}
} else {
return i;
}
}

@Override
public void end() {
final Boolean applyDelayedWrites = transactionType().equals(ReadWrite.READ)
&& delayedWrites.get().size() > 0;
super.end();
if (applyDelayedWrites) {
begin(ReadWrite.WRITE);
try {
Map m = delayedWrites.get();
m.forEach(new BiConsumer() {
@Override
public void accept(Quad t, DelayedWrite u) {
u.quadTable.begin(ReadWrite.READ);
u.quadTable.find(Node.ANY, Node.ANY, Node.ANY, Node.ANY).forEach(quad ->
add(quad));
u.quadTable.end();
}
});
commit(); // All the delay writes have now been committed to the dataset
graph.
// Add all the cached quads to the global cache
m.keySet().forEach(q -> cached.add(q));
} finally {
super.end();
}
}
}
}


On 15 March 2016 at 21:50, Andy Seaborne  wrote:

> On 15/03/16 14:12, A. Soroka wrote:
>
>> I guess you could use addGraph to intercept and alter (or substitute) the
>> graph,
>>
>
> Or
> getGraph(graphNode)
>
>
> but that seems like a real distortion of the semantics. Seems like the
>> AFS-Dev material is more to 

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Andy Seaborne

On 15/03/16 14:12, A. Soroka wrote:

I guess you could use addGraph to intercept and alter (or substitute) the graph,


Or
getGraph(graphNode)



but that seems like a real distortion of the semantics. Seems like the
AFS-Dev material is more to the point here.





Andy, what do you think it would take to get that stuff to Jena
master? Do you think it is ready for that? I would be happy to
refactor TIM to use it instead of the stuff it currently uses in
o.a.j.sparql.core.mem.


I don't think its ready - it has not been used "in anger" and it may be 
the wrong design.  It needs trying out outside the codebase.  TIM works 
as it currently so it isn't a rush to put it in there.



(digression:...)

I was at a talk recently about high performance java and the issue of 
object churn was mentioned as being quite impactful on the GC as the 
heap size grows.  Once a long time ago, object creation was expensive 
... then CPUS got faster and the java runtime smarter and it was less of 
an issue ... but it seems that its returning as a factor


inline lambdas are apparently faster than the same code with  a class 
implementation - the compiler emits an invokedynamic for the lanmbda


and Java Stream can cause a lot of short-lived objects.

Andy



---
A. Soroka
The University of Virginia Library


On Mar 15, 2016, at 7:39 AM, Dick Murray  wrote:

Eureka moment! It returns a new Graph of a certain type. Whereas I need the
graph node to determine where the underlying data is.

Cheers Dick.

On 15 March 2016 at 11:28, Andy Seaborne  wrote:


On 15/03/16 10:30, Dick Murray wrote:


Sorry, supportsTransactionAbort() in AFS-Dev
/src
/main
/java
/projects
/dsg2
/
*DatasetGraphStorage.java*



*Experimental code.*



supportsTransactionAbort is in the DatasetGraph interface in Jena.


DatasetGraphStorage is using TransactionalLock.createMRSW

As mentioned, it needs cooperation from the underlying thing to be able to
do aborts and MRSW does not provide that (it's external locking).

DatasetGraphStorage doesn't presume that the storage unit is transactional.

After these discussions I've decided to create a DatasetGraphOnDemand which

extends DatasetGraphMap and uses Union graphs.

However in DatasetGraphMap shouldn't getGraphCreate() be
getGraphCreate(Node graphNode) as otherwise it doesn't know what to
create?



It creates a graph - addGraph(graphNode, g) is managing the naming. Grapsh
don't know the name used (in other places one graph can have many names).

DatasetGraphMap is for a collection of independent graphs to be turned
into a dataset.

Andy



 @Override
 public Graph getGraph(Node graphNode)
 {
 Graph g = graphs.get(graphNode) ;
 if ( g == null )
 {
 g = getGraphCreate() ;
 if ( g != null )
 addGraph(graphNode, g) ;
 }
 return g ;
 }

 /** Called from getGraph when a nonexistent graph is asked for.
  * Return null for "nothing created as a graph"
  */
 protected Graph getGraphCreate() { return null ; }

Dick.










Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread A. Soroka
I guess you could use addGraph to intercept and alter (or substitute) the 
graph, but that seems like a real distortion of the semantics. Seems like the 
AFS-Dev material is more to the point here. Andy, what do you think it would 
take to get that stuff to Jena master? Do you think it is ready for that? I 
would be happy to refactor TIM to use it instead of the stuff it currently uses 
in o.a.j.sparql.core.mem.

---
A. Soroka
The University of Virginia Library

> On Mar 15, 2016, at 7:39 AM, Dick Murray  wrote:
> 
> Eureka moment! It returns a new Graph of a certain type. Whereas I need the
> graph node to determine where the underlying data is.
> 
> Cheers Dick.
> 
> On 15 March 2016 at 11:28, Andy Seaborne  wrote:
> 
>> On 15/03/16 10:30, Dick Murray wrote:
>> 
>>> Sorry, supportsTransactionAbort() in AFS-Dev
>>> /src
>>> /main
>>> /java
>>> /projects
>>> /dsg2
>>> /
>>> *DatasetGraphStorage.java*
>>> 
>> 
>> *Experimental code.*
>> 
>> 
>> 
>> supportsTransactionAbort is in the DatasetGraph interface in Jena.
>> 
>> 
>> DatasetGraphStorage is using TransactionalLock.createMRSW
>> 
>> As mentioned, it needs cooperation from the underlying thing to be able to
>> do aborts and MRSW does not provide that (it's external locking).
>> 
>> DatasetGraphStorage doesn't presume that the storage unit is transactional.
>> 
>> After these discussions I've decided to create a DatasetGraphOnDemand which
>>> extends DatasetGraphMap and uses Union graphs.
>>> 
>>> However in DatasetGraphMap shouldn't getGraphCreate() be
>>> getGraphCreate(Node graphNode) as otherwise it doesn't know what to
>>> create?
>>> 
>> 
>> It creates a graph - addGraph(graphNode, g) is managing the naming. Grapsh
>> don't know the name used (in other places one graph can have many names).
>> 
>> DatasetGraphMap is for a collection of independent graphs to be turned
>> into a dataset.
>> 
>>Andy
>> 
>> 
>>> @Override
>>> public Graph getGraph(Node graphNode)
>>> {
>>> Graph g = graphs.get(graphNode) ;
>>> if ( g == null )
>>> {
>>> g = getGraphCreate() ;
>>> if ( g != null )
>>> addGraph(graphNode, g) ;
>>> }
>>> return g ;
>>> }
>>> 
>>> /** Called from getGraph when a nonexistent graph is asked for.
>>>  * Return null for "nothing created as a graph"
>>>  */
>>> protected Graph getGraphCreate() { return null ; }
>>> 
>>> Dick.
>>> 
>>> 
>> 



Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Dick Murray
Eureka moment! It returns a new Graph of a certain type. Whereas I need the
graph node to determine where the underlying data is.

Cheers Dick.

On 15 March 2016 at 11:28, Andy Seaborne  wrote:

> On 15/03/16 10:30, Dick Murray wrote:
>
>> Sorry, supportsTransactionAbort() in AFS-Dev
>> /src
>> /main
>> /java
>> /projects
>> /dsg2
>> /
>> *DatasetGraphStorage.java*
>>
>
> *Experimental code.*
>
>
>
> supportsTransactionAbort is in the DatasetGraph interface in Jena.
>
>
> DatasetGraphStorage is using TransactionalLock.createMRSW
>
> As mentioned, it needs cooperation from the underlying thing to be able to
> do aborts and MRSW does not provide that (it's external locking).
>
> DatasetGraphStorage doesn't presume that the storage unit is transactional.
>
> After these discussions I've decided to create a DatasetGraphOnDemand which
>> extends DatasetGraphMap and uses Union graphs.
>>
>> However in DatasetGraphMap shouldn't getGraphCreate() be
>> getGraphCreate(Node graphNode) as otherwise it doesn't know what to
>> create?
>>
>
> It creates a graph - addGraph(graphNode, g) is managing the naming. Grapsh
> don't know the name used (in other places one graph can have many names).
>
> DatasetGraphMap is for a collection of independent graphs to be turned
> into a dataset.
>
> Andy
>
>
>>  @Override
>>  public Graph getGraph(Node graphNode)
>>  {
>>  Graph g = graphs.get(graphNode) ;
>>  if ( g == null )
>>  {
>>  g = getGraphCreate() ;
>>  if ( g != null )
>>  addGraph(graphNode, g) ;
>>  }
>>  return g ;
>>  }
>>
>>  /** Called from getGraph when a nonexistent graph is asked for.
>>   * Return null for "nothing created as a graph"
>>   */
>>  protected Graph getGraphCreate() { return null ; }
>>
>> Dick.
>>
>>
>


Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Andy Seaborne

On 15/03/16 10:30, Dick Murray wrote:

Sorry, supportsTransactionAbort() in AFS-Dev
/src
/main
/java
/projects
/dsg2
/
*DatasetGraphStorage.java*


*Experimental code.*



supportsTransactionAbort is in the DatasetGraph interface in Jena.


DatasetGraphStorage is using TransactionalLock.createMRSW

As mentioned, it needs cooperation from the underlying thing to be able 
to do aborts and MRSW does not provide that (it's external locking).


DatasetGraphStorage doesn't presume that the storage unit is transactional.


After these discussions I've decided to create a DatasetGraphOnDemand which
extends DatasetGraphMap and uses Union graphs.

However in DatasetGraphMap shouldn't getGraphCreate() be
getGraphCreate(Node graphNode) as otherwise it doesn't know what to create?


It creates a graph - addGraph(graphNode, g) is managing the naming. 
Grapsh don't know the name used (in other places one graph can have many 
names).


DatasetGraphMap is for a collection of independent graphs to be turned 
into a dataset.


Andy


 @Override
 public Graph getGraph(Node graphNode)
 {
 Graph g = graphs.get(graphNode) ;
 if ( g == null )
 {
 g = getGraphCreate() ;
 if ( g != null )
 addGraph(graphNode, g) ;
 }
 return g ;
 }

 /** Called from getGraph when a nonexistent graph is asked for.
  * Return null for "nothing created as a graph"
  */
 protected Graph getGraphCreate() { return null ; }

Dick.





Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-15 Thread Dick Murray
Sorry, supportsTransactionAbort() in AFS-Dev
<https://github.com/afs/AFS-Dev>/src
<https://github.com/afs/AFS-Dev/tree/master/src>/main
<https://github.com/afs/AFS-Dev/tree/master/src/main>/java
<https://github.com/afs/AFS-Dev/tree/master/src/main/java>/projects
<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects>/dsg2
<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2>/
*DatasetGraphStorage.java*

After these discussions I've decided to create a DatasetGraphOnDemand which
extends DatasetGraphMap and uses Union graphs.

However in DatasetGraphMap shouldn't getGraphCreate() be
getGraphCreate(Node graphNode) as otherwise it doesn't know what to create?

@Override
public Graph getGraph(Node graphNode)
{
Graph g = graphs.get(graphNode) ;
if ( g == null )
{
g = getGraphCreate() ;
if ( g != null )
addGraph(graphNode, g) ;
}
return g ;
}

/** Called from getGraph when a nonexistent graph is asked for.
 * Return null for "nothing created as a graph"
 */
protected Graph getGraphCreate() { return null ; }

Dick.

On 14 March 2016 at 09:56, Andy Seaborne <a...@apache.org> wrote:

> On 14/03/16 07:31, Joint wrote:
>
>>
>>
>> 
>> That doesn't read well...
>> I tested two types of triple storage both of which use a concurrent map
>> to track the graphs. The first used the TripleTable and took write locks so
>> there was one write per graph. The second used a concurrent skip list set
>> and no write locks so there is no write contention.
>> Your dev code has a method canAbort set to return false.I was wondering
>> what the idea was?
>>
>
> Where is canAbort?
> Are you looking at the Jena code or Mantis code?
> Do you mean supportsTransactionAbort?
>
> A system can't provide a proper abort unless it can reconstruct the old
> state, either by having two copies (Txn in memory does this) or a log of
> some kind (TDB does this).
>
> For example, plain synchronization MRSW locking can't provide an abort
> operation. It needs the cooperation of components to do that.
>
> Andy
>
>
>
>
>
>> Dick
>>
>> ---- Original message ----
>> From: Andy Seaborne <a...@apache.org>
>> Date: 13/03/2016  7:54 pm  (GMT+00:00)
>> To: users@jena.apache.org
>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>DatasetGraphInMemory
>>
>> On 10/03/16 20:10, Dick Murray wrote:
>>
>>> Hi. Yes re TriTable and TripleTable. I too like the storage interface
>>> which
>>> would work for my needs and make life simpler. A few points from me.
>>> Currently I wrap an existing dsg and cache the additional tuples into
>>> what
>>> I call the deferred DSG or DDSG. The finds return a DSG iterator and a
>>> DDSG
>>> iterator.
>>>
>>> The DDSG is in memory and I have a number of concrete classes which
>>> achieve
>>> the same end.
>>>
>>> Firstly i use a Jenna core men DSG and the find handles just add tuples
>>> as
>>> required into the HexTable because i don't have a default graph, i.e.
>>> it's
>>> never referenced because i need a graph uri to find the deferred data.
>>>
>>> The second is in common I have a concurrent map which handles recording
>>> what graphs have been deferred then I either use TriTable or a concurrent
>>> set of tuples to store the graph contents. When I'm using the TriTable I
>>> acquire the write lock and add tuples. So writes can occur in parallel to
>>> different graphs. I've experimented with the concurrent set by spoofing
>>> the
>>> write and just adding the tuples I.e. no write lock contention per
>>> graph. I
>>> notice the datatsetgraphstorage
>>>
>>
>> 
>>
>> does not support txn abort? This gives an
>>> in memory DSG which doesn't have lock contention because it never
>>> locks...
>>> This is applicable in some circumstances and I think that the right
>>> deferred tuples is one of them?
>>>
>>> I also coded a DSG which supports a reentrerant RW with upgrade lock
>>> which
>>> allowed me to combine the two DSG's because I could promote the read
>>> lock.
>>>
>>> Andy I notice your code has a txn interface with a read to write
>>> promotion
>>> indicator? Is an upgrade method being considered to the txn interface
>>> because that was an issue I hit and why I have two dsg's. Code further up
&g

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-14 Thread Andy Seaborne

On 14/03/16 07:31, Joint wrote:




That doesn't read well...
I tested two types of triple storage both of which use a concurrent map to 
track the graphs. The first used the TripleTable and took write locks so there 
was one write per graph. The second used a concurrent skip list set and no 
write locks so there is no write contention.
Your dev code has a method canAbort set to return false.I was wondering what 
the idea was?


Where is canAbort?
Are you looking at the Jena code or Mantis code?
Do you mean supportsTransactionAbort?

A system can't provide a proper abort unless it can reconstruct the old 
state, either by having two copies (Txn in memory does this) or a log of 
some kind (TDB does this).


For example, plain synchronization MRSW locking can't provide an abort 
operation. It needs the cooperation of components to do that.


Andy





Dick

 Original message 
From: Andy Seaborne <a...@apache.org>
Date: 13/03/2016  7:54 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
   DatasetGraphInMemory

On 10/03/16 20:10, Dick Murray wrote:

Hi. Yes re TriTable and TripleTable. I too like the storage interface which
would work for my needs and make life simpler. A few points from me.
Currently I wrap an existing dsg and cache the additional tuples into what
I call the deferred DSG or DDSG. The finds return a DSG iterator and a DDSG
iterator.

The DDSG is in memory and I have a number of concrete classes which achieve
the same end.

Firstly i use a Jenna core men DSG and the find handles just add tuples as
required into the HexTable because i don't have a default graph, i.e. it's
never referenced because i need a graph uri to find the deferred data.

The second is in common I have a concurrent map which handles recording
what graphs have been deferred then I either use TriTable or a concurrent
set of tuples to store the graph contents. When I'm using the TriTable I
acquire the write lock and add tuples. So writes can occur in parallel to
different graphs. I've experimented with the concurrent set by spoofing the
write and just adding the tuples I.e. no write lock contention per graph. I
notice the datatsetgraphstorage





does not support txn abort? This gives an
in memory DSG which doesn't have lock contention because it never locks...
This is applicable in some circumstances and I think that the right
deferred tuples is one of them?

I also coded a DSG which supports a reentrerant RW with upgrade lock which
allowed me to combine the two DSG's because I could promote the read lock.

Andy I notice your code has a txn interface with a read to write promotion
indicator? Is an upgrade method being considered to the txn interface
because that was an issue I hit and why I have two dsg's. Code further up
the stack calls a txn read but a cache miss needs a write to persist the
new tuples.

A dynamic adapter would support a defined set of handles and the find would
be shimmed to check if any tuples need to be added. If we could define a
set of interfaces to achieve this which shouldn't be too difficult.

On the subject of storage is there any thought to providing granular
locking, DSG, per graph, dirty..?

Dick


Per graph indexing only makes sense if the graphs are held separately.
A quad table isn't going to work very well because some quads are in one
graph and some in another yet all in the same index structure.

So a ConcurrentHashMap holding (c.f. what is now called
DatasetGraphMapLink) separate graphs would seem to make sense.
Contributions welcome.

Transaction promotion is an interestingly tricky thing - it can mean a
system has to cause aborts or lower the isolation guarantees. (e.g. Txn1
starts Read, Txn2 starts write-updates-commits, Txn1 continues, can't
see Txn2 changes (note it may be before or after Txn2 ran), Txn attempts
to promote to a W transaction.  Read-committed leads to non-repeatable
reads (things like count() go wrong for example).

When you say "your code has a txn interface" I take you mean non-jena code?

That all said, this sound like a simpler case - just because a read
transaction needs to update internal caches does not mean it's the fully
general case of transaction promotion.  A lock and weaker isolation may do.

Andy








Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-14 Thread Joint



That doesn't read well...
I tested two types of triple storage both of which use a concurrent map to 
track the graphs. The first used the TripleTable and took write locks so there 
was one write per graph. The second used a concurrent skip list set and no 
write locks so there is no write contention.
Your dev code has a method canAbort set to return false.I was wondering what 
the idea was?

Dick

 Original message 
From: Andy Seaborne <a...@apache.org> 
Date: 13/03/2016  7:54 pm  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory 

On 10/03/16 20:10, Dick Murray wrote:
> Hi. Yes re TriTable and TripleTable. I too like the storage interface which
> would work for my needs and make life simpler. A few points from me.
> Currently I wrap an existing dsg and cache the additional tuples into what
> I call the deferred DSG or DDSG. The finds return a DSG iterator and a DDSG
> iterator.
>
> The DDSG is in memory and I have a number of concrete classes which achieve
> the same end.
>
> Firstly i use a Jenna core men DSG and the find handles just add tuples as
> required into the HexTable because i don't have a default graph, i.e. it's
> never referenced because i need a graph uri to find the deferred data.
>
> The second is in common I have a concurrent map which handles recording
> what graphs have been deferred then I either use TriTable or a concurrent
> set of tuples to store the graph contents. When I'm using the TriTable I
> acquire the write lock and add tuples. So writes can occur in parallel to
> different graphs. I've experimented with the concurrent set by spoofing the
> write and just adding the tuples I.e. no write lock contention per graph. I
> notice the datatsetgraphstorage



> does not support txn abort? This gives an
> in memory DSG which doesn't have lock contention because it never locks...
> This is applicable in some circumstances and I think that the right
> deferred tuples is one of them?
>
> I also coded a DSG which supports a reentrerant RW with upgrade lock which
> allowed me to combine the two DSG's because I could promote the read lock.
>
> Andy I notice your code has a txn interface with a read to write promotion
> indicator? Is an upgrade method being considered to the txn interface
> because that was an issue I hit and why I have two dsg's. Code further up
> the stack calls a txn read but a cache miss needs a write to persist the
> new tuples.
>
> A dynamic adapter would support a defined set of handles and the find would
> be shimmed to check if any tuples need to be added. If we could define a
> set of interfaces to achieve this which shouldn't be too difficult.
>
> On the subject of storage is there any thought to providing granular
> locking, DSG, per graph, dirty..?
>
> Dick

Per graph indexing only makes sense if the graphs are held separately. 
A quad table isn't going to work very well because some quads are in one 
graph and some in another yet all in the same index structure.

So a ConcurrentHashMap holding (c.f. what is now called 
DatasetGraphMapLink) separate graphs would seem to make sense. 
Contributions welcome.

Transaction promotion is an interestingly tricky thing - it can mean a 
system has to cause aborts or lower the isolation guarantees. (e.g. Txn1 
starts Read, Txn2 starts write-updates-commits, Txn1 continues, can't 
see Txn2 changes (note it may be before or after Txn2 ran), Txn attempts 
to promote to a W transaction.  Read-committed leads to non-repeatable 
reads (things like count() go wrong for example).

When you say "your code has a txn interface" I take you mean non-jena code?

That all said, this sound like a simpler case - just because a read 
transaction needs to update internal caches does not mean it's the fully 
general case of transaction promotion.  A lock and weaker isolation may do.

Andy






Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-13 Thread Andy Seaborne

On 10/03/16 20:10, Dick Murray wrote:

Hi. Yes re TriTable and TripleTable. I too like the storage interface which
would work for my needs and make life simpler. A few points from me.
Currently I wrap an existing dsg and cache the additional tuples into what
I call the deferred DSG or DDSG. The finds return a DSG iterator and a DDSG
iterator.

The DDSG is in memory and I have a number of concrete classes which achieve
the same end.

Firstly i use a Jenna core men DSG and the find handles just add tuples as
required into the HexTable because i don't have a default graph, i.e. it's
never referenced because i need a graph uri to find the deferred data.

The second is in common I have a concurrent map which handles recording
what graphs have been deferred then I either use TriTable or a concurrent
set of tuples to store the graph contents. When I'm using the TriTable I
acquire the write lock and add tuples. So writes can occur in parallel to
different graphs. I've experimented with the concurrent set by spoofing the
write and just adding the tuples I.e. no write lock contention per graph. I
notice the datatsetgraphstorage





does not support txn abort? This gives an
in memory DSG which doesn't have lock contention because it never locks...
This is applicable in some circumstances and I think that the right
deferred tuples is one of them?

I also coded a DSG which supports a reentrerant RW with upgrade lock which
allowed me to combine the two DSG's because I could promote the read lock.

Andy I notice your code has a txn interface with a read to write promotion
indicator? Is an upgrade method being considered to the txn interface
because that was an issue I hit and why I have two dsg's. Code further up
the stack calls a txn read but a cache miss needs a write to persist the
new tuples.

A dynamic adapter would support a defined set of handles and the find would
be shimmed to check if any tuples need to be added. If we could define a
set of interfaces to achieve this which shouldn't be too difficult.

On the subject of storage is there any thought to providing granular
locking, DSG, per graph, dirty..?

Dick


Per graph indexing only makes sense if the graphs are held separately. 
A quad table isn't going to work very well because some quads are in one 
graph and some in another yet all in the same index structure.


So a ConcurrentHashMap holding (c.f. what is now called 
DatasetGraphMapLink) separate graphs would seem to make sense. 
Contributions welcome.


Transaction promotion is an interestingly tricky thing - it can mean a 
system has to cause aborts or lower the isolation guarantees. (e.g. Txn1 
starts Read, Txn2 starts write-updates-commits, Txn1 continues, can't 
see Txn2 changes (note it may be before or after Txn2 ran), Txn attempts 
to promote to a W transaction.  Read-committed leads to non-repeatable 
reads (things like count() go wrong for example).


When you say "your code has a txn interface" I take you mean non-jena code?

That all said, this sound like a simpler case - just because a read 
transaction needs to update internal caches does not mean it's the fully 
general case of transaction promotion.  A lock and weaker isolation may do.


Andy






Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-10 Thread A. Soroka
On this particular point, there has been such discussion recently:

http://markmail.org/message/wo5r3edi7xzt7zmx
http://markmail.org/message/hxao4izpiv7quumv

but no action that I know of. (Claude Warren would know more than me.)

---
A. Soroka
The University of Virginia Library

> On Mar 10, 2016, at 3:10 PM, Dick Murray  wrote:
> 
> On the subject of storage is there any thought to providing granular locking, 
> DSG, per graph, dirty..?



Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-10 Thread Dick Murray
   - Query the ifc2x3 schema to list the explicit Entity
>>attributes and for each we add a triple to TriTable e.g.
>>ifcslab:ifcorganization =
>>
{
>>

>> }
>>- In addition we add the triple
>>
{ a
>>}.
>>- If we are creating linked triples (i.e. max depth > 1)
>>then for each attribute which has a SDAI entity
>> instance value call the
>>appropriate handle to create the triples.
>> - G commits the TriTable write transaction (make the triples
>>  visible before we update the find triples!).
>>  - G updates the find triples to include;
>>  - {ANY, a }
>> - {
>> ANY ANY}
>> - Repeat the above for any linked triples created.
>> - The TriTable now contains the triples required to answer
the
>>  find triple.
>>  - G will return TriTable.find(ANY, a
>>)
>>- Jena ends the DG read transaction.
>>
>>
>> Some find triples will result in the appropriate handle being called
>> (handle hit) which will create triples. Others will handle miss and be
>> passed on to the TriTable find (e.g. no triples created and TriTable will
>> return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
>> example because does this mean create all of the triples (+100M) or all
of
>> the currently created triples (which relies on having queried what you
need
>> to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful
to
>> ask this find?
>>
>> Hope that clear up the "writes are not supported" (the underlying data is
>> read only) and why the TupleTable subtypes are not problematic. I could
>> have held the created triples per find triple but that wouldn't scale
with
>> duplication plus why recreate the wheel when if I'm not mistaken TriTable
>> uses the dexx collection giving subsequent HAMT advantages which is what
a
>> high performance in memory implementation requires. The solution is
working
>> and compared to a fully transformed TDB is giving the correct results. To
>> do might include timing out the G when they have not been accessed for a
>> period of time...
>>
>> Finally having wrote the wrapper I thought it wouldn't be used anywhere
>> else but subsequently it was used to abstract an existing system where
>> adhoc semantic access was required and it's lined to do a similar task on
>> two other data silos. Hence the question to Andy regarding a Jena cached
>> SPI package.
>>
>> Thanks again for your help Adam/Andy.
>>
>> Dick.
>>
>>
>>
>> On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:
>>
>>> I’m confused about two of your points here. Let me separate them out so
we
>>> can discuss them easily.
>>>
>>> 1) "writes are not supported”:
>>>
>>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>>> out-of-the-box implementations of RDF storage. Can you explain what you
>>> mean by this?
>>>
>>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>>> triple caching algorithm”:
>>>
>>> The subtypes of TupleTable with which you are working have exactly the
>>> same kinds of find() methods. Why are they not problematic in that
context?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>>> On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> Hi Andy.
>>>> I implemented the entire SPI at the DatasetGraph and Graph level. It
got
>>> to the point where I had overridden more methods than not. In addition
>>> writes are not supported and contains methods which call find(ANY, ANY,
>>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm
using
>>> the TriTable because it fits and quads are spoofed via triple to quad
>>> iterator.
>>>> I have a set of filters and handles which the find triple is compared
>>> against and either passed straight to the TriTable if the triple has
been
>>> handled before or its passed to the appropriate handle which adds the
>>> triples to the TriTable then calls the find. As the u

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-10 Thread A. Soroka
c. I could
>> have held the created triples per find triple but that wouldn't scale with
>> duplication plus why recreate the wheel when if I'm not mistaken TriTable
>> uses the dexx collection giving subsequent HAMT advantages which is what a
>> high performance in memory implementation requires. The solution is working
>> and compared to a fully transformed TDB is giving the correct results. To
>> do might include timing out the G when they have not been accessed for a
>> period of time...
>> 
>> Finally having wrote the wrapper I thought it wouldn't be used anywhere
>> else but subsequently it was used to abstract an existing system where
>> adhoc semantic access was required and it's lined to do a similar task on
>> two other data silos. Hence the question to Andy regarding a Jena cached
>> SPI package.
>> 
>> Thanks again for your help Adam/Andy.
>> 
>> Dick.
>> 
>> 
>> 
>> On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:
>> 
>>> I’m confused about two of your points here. Let me separate them out so we
>>> can discuss them easily.
>>> 
>>> 1) "writes are not supported”:
>>> 
>>> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
>>> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
>>> and DatasetGraph are the basic abstractions implemented by Jena’s own
>>> out-of-the-box implementations of RDF storage. Can you explain what you
>>> mean by this?
>>> 
>>> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
>>> triple caching algorithm”:
>>> 
>>> The subtypes of TupleTable with which you are working have exactly the
>>> same kinds of find() methods. Why are they not problematic in that context?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>>> On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Andy.
>>>> I implemented the entire SPI at the DatasetGraph and Graph level. It got
>>> to the point where I had overridden more methods than not. In addition
>>> writes are not supported and contains methods which call find(ANY, ANY,
>>> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
>>> the TriTable because it fits and quads are spoofed via triple to quad
>>> iterator.
>>>> I have a set of filters and handles which the find triple is compared
>>> against and either passed straight to the TriTable if the triple has been
>>> handled before or its passed to the appropriate handle which adds the
>>> triples to the TriTable then calls the find. As the underlying data is a
>>> tree a cache depth can be set which allows related triples to be cached.
>>> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
>>>> Would you consider a generic version for the Jena code base?
>>>> 
>>>> 
>>>> Dick
>>>> 
>>>>  Original message 
>>>> From: Andy Seaborne <a...@apache.org>
>>>> Date: 18/02/2016  6:31 pm  (GMT+00:00)
>>>> To: users@jena.apache.org
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>>>>  DatasetGraphInMemory
>>>> 
>>>> Hi,
>>>> 
>>>> I'm not seeing how tapping into the implementation of
>>>> DatasetGraphInMemory is going to help (through the details
>>>> 
>>>> As well as the DatasetGraphMap approach, one other thought that occurred
>>>> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
>>>> implementation.
>>>> 
>>>> It loads, and clears, the mapped graph on-demand, and passes the find()
>>>> call through to the now-setup data.
>>>> 
>>>>   Andy
>>>> 
>>>> On 16/02/16 17:42, A. Soroka wrote:
>>>>>> Based on your description the DatasetGraphInMemory would seem to match
>>> the dynamic load requirement. How did you foresee it being loaded? Is there
>>> a large over head to using the add methods?
>>>>> 
>>>>> No, I certainly did not mean to give that impression, and I don’t think
>>> it is entirely accurate. DSGInMemory was definitely not at all meant for
>>> dynamic loading. That doesn’t mean it can’t be used that way, but that was
>>> not in t

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-10 Thread Andy Seaborne
 the cache can be preloaded with common triples e.g. ANY RDF:type ?.

Would you consider a generic version for the Jena code base?


Dick

 Original message 
From: Andy Seaborne <a...@apache.org>
Date: 18/02/2016  6:31 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory

Hi,

I'm not seeing how tapping into the implementation of
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find()
call through to the now-setup data.

   Andy

On 16/02/16 17:42, A. Soroka wrote:

Based on your description the DatasetGraphInMemory would seem to match

the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?


No, I certainly did not mean to give that impression, and I don’t think

it is entirely accurate. DSGInMemory was definitely not at all meant for
dynamic loading. That doesn’t mean it can’t be used that way, but that was
not in the design, which assumed that all tuples take about the same amount
of time to access and that all of the same type are coming from the same
implementation (in a QuadTable and a TripleTable).


The overhead of mutating a dataset is mostly inside the implementations

of TupleTable that are actually used to store tuples. You should be aware
that TupleTable extends TransactionalComponent, so if you want to use it to
create some kind of connection to your storage, you will need to make that
connection fully transactional. That doesn’t sound at all trivial in your
case.


At this point it seems to me that extending DatasetGraphMap (and

implementing GraphMaker and Graph instead of TupleTable) might be a more
appropriate design for your work. You can put dynamic loading behavior in
Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
Are there reasons around the use of transactionality in your work that
demand the particular semantics supported by DSGInMemory?


---
A. Soroka
The University of Virginia Library


On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:



Hi.
The quick full scenario is a distributed DaaS which supports queries,

updates, transforms and bulkloads. Andy Seaborne knows some of the detail
because I spoke to him previously. We achieve multiple writes by having
parallel Datasets, both traditional TDB and on demand in memory. Writes are
sent to a free dataset, free being not in a write transaction. That's a
simplistic overview...

Queries are handled by a dataset proxy which builds a dynamic dataset

based on the graph URIs. For example the graph URI urn:Iungo:all causes the
proxy find method to issue the query to all known Datasets and return the
union of results. Various dataset proxies exist, some load TDBs, others
load TTL files into graphs, others dynamically create tuples. The common
thing being they are all presented as Datasets backed by DatasetGraph. Thus
a SPARQL query can result in multiple Datasets being loaded to satisfy the
query.

Nodes can be preloaded which then load Datasets to satisfy finds. This

way the system can be scaled to handle increased work loads. Also specific
nodes can be targeted to specific hardware.

When a graph URI is encountered the proxy can interpret it's

structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
SDAI repository foo to be dynamically loaded into memory along with the
quads which are required to satisfy the find.

Typically a group of people will be working on a set of data so the

first to query will load the dataset then it will be accessed multiple
times. There will be an initial dynamic load of data which will tail off
with some additional loading over time.

Based on your description the DatasetGraphInMemory would seem to match

the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?

A typical scenario would be to search all SDAI repository's for some

key information then load detailed information in some, continuing to drill
down.

Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've

already shimed the DGTriplesQuads so the actual caching code already exists
and should bed easy to hook on.

Dick

 Original message 
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016  11:07 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using

DatasetGraphInMemory


Okay, I’m more confident at this point that you’re not well served by

DatasetGraphInMemory, which has very strong assumptions about the speedy
reachability of data. DSGInMemory was built for situations when all of the
data is in c

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-04 Thread Andy Seaborne

On 03/03/16 10:47, Joint wrote:



Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got to the 
point where I had overridden more methods than not. In addition writes are not 
supported and contains methods which call find(ANY, ANY, ANY) play havoc with 
an on demand triple caching algorithm! ;-) I'm using the TriTable because it 
fits and quads are spoofed via triple to quad iterator.
I have a set of filters and handles which the find triple is compared against 
and either passed straight to the TriTable if the triple has been handled 
before or its passed to the appropriate handle which adds the triples to the 
TriTable then calls the find. As the underlying data is a tree a cache depth 
can be set which allows related triples to be cached. Also the cache can be 
preloaded with common triples e.g. ANY RDF:type ?.
Would you consider a generic version for the Jena code base?


Sure - if it is a general capability, then it would be good to have a 
framework for writing read-only adapters to external data.




I'm still unclear as to why you aren't hooking to one subclass of 
DatasetGraph but maybe the code will make that clearer.


Andy




Dick

 Original message 
From: Andy Seaborne <a...@apache.org>
Date: 18/02/2016  6:31 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
   DatasetGraphInMemory

Hi,

I'm not seeing how tapping into the implementation of
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find()
call through to the now-setup data.

Andy

On 16/02/16 17:42, A. Soroka wrote:

Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?


No, I certainly did not mean to give that impression, and I don’t think it is 
entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
loading. That doesn’t mean it can’t be used that way, but that was not in the 
design, which assumed that all tuples take about the same amount of time to 
access and that all of the same type are coming from the same implementation 
(in a QuadTable and a TripleTable).

The overhead of mutating a dataset is mostly inside the implementations of 
TupleTable that are actually used to store tuples. You should be aware that 
TupleTable extends TransactionalComponent, so if you want to use it to create 
some kind of connection to your storage, you will need to make that connection 
fully transactional. That doesn’t sound at all trivial in your case.

At this point it seems to me that extending DatasetGraphMap (and implementing 
GraphMaker and Graph instead of TupleTable) might be a more appropriate design 
for your work. You can put dynamic loading behavior in Graph (or a GraphView 
subtype) just as easily as in TupleTable subtypes. Are there reasons around the 
use of transactionality in your work that demand the particular semantics 
supported by DSGInMemory?

---
A. Soroka
The University of Virginia Library


On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:



Hi.
The quick full scenario is a distributed DaaS which supports queries, updates, 
transforms and bulkloads. Andy Seaborne knows some of the detail because I 
spoke to him previously. We achieve multiple writes by having parallel 
Datasets, both traditional TDB and on demand in memory. Writes are sent to a 
free dataset, free being not in a write transaction. That's a simplistic 
overview...
Queries are handled by a dataset proxy which builds a dynamic dataset based on 
the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find 
method to issue the query to all known Datasets and return the union of 
results. Various dataset proxies exist, some load TDBs, others load TTL files 
into graphs, others dynamically create tuples. The common thing being they are 
all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can 
result in multiple Datasets being loaded to satisfy the query.
Nodes can be preloaded which then load Datasets to satisfy finds. This way the 
system can be scaled to handle increased work loads. Also specific nodes can be 
targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's structure. So 
urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
foo to be dynamically loaded into memory along with the quads which are 
required to satisfy the find.
Typically a group of people will be working on a set of data so the first to 
query will load the dataset then it will be accessed multiple times. There will 
be an initial dynamic load of data which w

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-04 Thread Dick Murray
?

Hope that clear up the "writes are not supported" (the underlying data is
read only) and why the TupleTable subtypes are not problematic. I could
have held the created triples per find triple but that wouldn't scale with
duplication plus why recreate the wheel when if I'm not mistaken TriTable
uses the dexx collection giving subsequent HAMT advantages which is what a
high performance in memory implementation requires. The solution is working
and compared to a fully transformed TDB is giving the correct results. To
do might include timing out the G when they have not been accessed for a
period of time...

Finally having wrote the wrapper I thought it wouldn't be used anywhere
else but subsequently it was used to abstract an existing system where
adhoc semantic access was required and it's lined to do a similar task on
two other data silos. Hence the question to Andy regarding a Jena cached
SPI package.

Thanks again for your help Adam/Andy.

Dick.



On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:

> I’m confused about two of your points here. Let me separate them out so we
> can discuss them easily.
>
> 1) "writes are not supported”:
>
> Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
> and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
> and DatasetGraph are the basic abstractions implemented by Jena’s own
> out-of-the-box implementations of RDF storage. Can you explain what you
> mean by this?
>
> 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
> triple caching algorithm”:
>
> The subtypes of TupleTable with which you are working have exactly the
> same kinds of find() methods. Why are they not problematic in that context?
>
> ---
> A. Soroka
> The University of Virginia Library
>
> > On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
> >
> >
> >
> > Hi Andy.
> > I implemented the entire SPI at the DatasetGraph and Graph level. It got
> to the point where I had overridden more methods than not. In addition
> writes are not supported and contains methods which call find(ANY, ANY,
> ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
> the TriTable because it fits and quads are spoofed via triple to quad
> iterator.
> > I have a set of filters and handles which the find triple is compared
> against and either passed straight to the TriTable if the triple has been
> handled before or its passed to the appropriate handle which adds the
> triples to the TriTable then calls the find. As the underlying data is a
> tree a cache depth can be set which allows related triples to be cached.
> Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
> > Would you consider a generic version for the Jena code base?
> >
> >
> > Dick
> >
> > ---- Original message 
> > From: Andy Seaborne <a...@apache.org>
> > Date: 18/02/2016  6:31 pm  (GMT+00:00)
> > To: users@jena.apache.org
> > Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
> >  DatasetGraphInMemory
> >
> > Hi,
> >
> > I'm not seeing how tapping into the implementation of
> > DatasetGraphInMemory is going to help (through the details
> >
> > As well as the DatasetGraphMap approach, one other thought that occurred
> > to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
> > implementation.
> >
> > It loads, and clears, the mapped graph on-demand, and passes the find()
> > call through to the now-setup data.
> >
> >   Andy
> >
> > On 16/02/16 17:42, A. Soroka wrote:
> >>> Based on your description the DatasetGraphInMemory would seem to match
> the dynamic load requirement. How did you foresee it being loaded? Is there
> a large over head to using the add methods?
> >>
> >> No, I certainly did not mean to give that impression, and I don’t think
> it is entirely accurate. DSGInMemory was definitely not at all meant for
> dynamic loading. That doesn’t mean it can’t be used that way, but that was
> not in the design, which assumed that all tuples take about the same amount
> of time to access and that all of the same type are coming from the same
> implementation (in a QuadTable and a TripleTable).
> >>
> >> The overhead of mutating a dataset is mostly inside the implementations
> of TupleTable that are actually used to store tuples. You should be aware
> that TupleTable extends TransactionalComponent, so if you want to use it to
> create some kind of connection to your storage, you will need to make that
> connection fully transactional. That doesn’t sound at all trivial 

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-03 Thread A. Soroka
I’m confused about two of your points here. Let me separate them out so we can 
discuss them easily.

1) "writes are not supported”:

Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add and 
::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph and 
DatasetGraph are the basic abstractions implemented by Jena’s own 
out-of-the-box implementations of RDF storage. Can you explain what you mean by 
this?

2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand triple 
caching algorithm”:

The subtypes of TupleTable with which you are working have exactly the same 
kinds of find() methods. Why are they not problematic in that context?

---
A. Soroka
The University of Virginia Library

> On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
> 
> 
> 
> Hi Andy.
> I implemented the entire SPI at the DatasetGraph and Graph level. It got to 
> the point where I had overridden more methods than not. In addition writes 
> are not supported and contains methods which call find(ANY, ANY, ANY) play 
> havoc with an on demand triple caching algorithm! ;-) I'm using the TriTable 
> because it fits and quads are spoofed via triple to quad iterator.
> I have a set of filters and handles which the find triple is compared against 
> and either passed straight to the TriTable if the triple has been handled 
> before or its passed to the appropriate handle which adds the triples to the 
> TriTable then calls the find. As the underlying data is a tree a cache depth 
> can be set which allows related triples to be cached. Also the cache can be 
> preloaded with common triples e.g. ANY RDF:type ?.
> Would you consider a generic version for the Jena code base?
> 
> 
> Dick
> 
>  Original message 
> From: Andy Seaborne <a...@apache.org> 
> Date: 18/02/2016  6:31 pm  (GMT+00:00) 
> To: users@jena.apache.org 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>  DatasetGraphInMemory 
> 
> Hi,
> 
> I'm not seeing how tapping into the implementation of 
> DatasetGraphInMemory is going to help (through the details
> 
> As well as the DatasetGraphMap approach, one other thought that occurred 
> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
> implementation.
> 
> It loads, and clears, the mapped graph on-demand, and passes the find() 
> call through to the now-setup data.
> 
>   Andy
> 
> On 16/02/16 17:42, A. Soroka wrote:
>>> Based on your description the DatasetGraphInMemory would seem to match the 
>>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>>> large over head to using the add methods?
>> 
>> No, I certainly did not mean to give that impression, and I don’t think it 
>> is entirely accurate. DSGInMemory was definitely not at all meant for 
>> dynamic loading. That doesn’t mean it can’t be used that way, but that was 
>> not in the design, which assumed that all tuples take about the same amount 
>> of time to access and that all of the same type are coming from the same 
>> implementation (in a QuadTable and a TripleTable).
>> 
>> The overhead of mutating a dataset is mostly inside the implementations of 
>> TupleTable that are actually used to store tuples. You should be aware that 
>> TupleTable extends TransactionalComponent, so if you want to use it to 
>> create some kind of connection to your storage, you will need to make that 
>> connection fully transactional. That doesn’t sound at all trivial in your 
>> case.
>> 
>> At this point it seems to me that extending DatasetGraphMap (and 
>> implementing GraphMaker and Graph instead of TupleTable) might be a more 
>> appropriate design for your work. You can put dynamic loading behavior in 
>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes. Are 
>> there reasons around the use of transactionality in your work that demand 
>> the particular semantics supported by DSGInMemory?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> Hi.
>>> The quick full scenario is a distributed DaaS which supports queries, 
>>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail 
>>> because I spoke to him previously. We achieve multiple writes by having 
>>> parallel Datasets, both traditional TDB and on demand in memory. Writes are 
>>> sent to a free dataset, free being not in a write transaction. That's a 
>>> simplistic overview...
>>> Queries are handl

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-03-03 Thread Joint


Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got to the 
point where I had overridden more methods than not. In addition writes are not 
supported and contains methods which call find(ANY, ANY, ANY) play havoc with 
an on demand triple caching algorithm! ;-) I'm using the TriTable because it 
fits and quads are spoofed via triple to quad iterator.
I have a set of filters and handles which the find triple is compared against 
and either passed straight to the TriTable if the triple has been handled 
before or its passed to the appropriate handle which adds the triples to the 
TriTable then calls the find. As the underlying data is a tree a cache depth 
can be set which allows related triples to be cached. Also the cache can be 
preloaded with common triples e.g. ANY RDF:type ?.
Would you consider a generic version for the Jena code base?


Dick

 Original message 
From: Andy Seaborne <a...@apache.org> 
Date: 18/02/2016  6:31 pm  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory 

Hi,

I'm not seeing how tapping into the implementation of 
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred 
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find() 
call through to the now-setup data.

Andy

On 16/02/16 17:42, A. Soroka wrote:
>> Based on your description the DatasetGraphInMemory would seem to match the 
>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>> large over head to using the add methods?
>
> No, I certainly did not mean to give that impression, and I don’t think it is 
> entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
> loading. That doesn’t mean it can’t be used that way, but that was not in the 
> design, which assumed that all tuples take about the same amount of time to 
> access and that all of the same type are coming from the same implementation 
> (in a QuadTable and a TripleTable).
>
> The overhead of mutating a dataset is mostly inside the implementations of 
> TupleTable that are actually used to store tuples. You should be aware that 
> TupleTable extends TransactionalComponent, so if you want to use it to create 
> some kind of connection to your storage, you will need to make that 
> connection fully transactional. That doesn’t sound at all trivial in your 
> case.
>
> At this point it seems to me that extending DatasetGraphMap (and implementing 
> GraphMaker and Graph instead of TupleTable) might be a more appropriate 
> design for your work. You can put dynamic loading behavior in Graph (or a 
> GraphView subtype) just as easily as in TupleTable subtypes. Are there 
> reasons around the use of transactionality in your work that demand the 
> particular semantics supported by DSGInMemory?
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
>>
>>
>>
>> Hi.
>> The quick full scenario is a distributed DaaS which supports queries, 
>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail 
>> because I spoke to him previously. We achieve multiple writes by having 
>> parallel Datasets, both traditional TDB and on demand in memory. Writes are 
>> sent to a free dataset, free being not in a write transaction. That's a 
>> simplistic overview...
>> Queries are handled by a dataset proxy which builds a dynamic dataset based 
>> on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy 
>> find method to issue the query to all known Datasets and return the union of 
>> results. Various dataset proxies exist, some load TDBs, others load TTL 
>> files into graphs, others dynamically create tuples. The common thing being 
>> they are all presented as Datasets backed by DatasetGraph. Thus a SPARQL 
>> query can result in multiple Datasets being loaded to satisfy the query.
>> Nodes can be preloaded which then load Datasets to satisfy finds. This way 
>> the system can be scaled to handle increased work loads. Also specific nodes 
>> can be targeted to specific hardware.
>> When a graph URI is encountered the proxy can interpret it's structure. So 
>> urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
>> foo to be dynamically loaded into memory along with the quads which are 
>> required to satisfy the find.
>> Typically a group of people will be working on a set of data so the first to 
>> query will load the dataset 

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-18 Thread Andy Seaborne

Hi,

I'm not seeing how tapping into the implementation of 
DatasetGraphInMemory is going to help (through the details


As well as the DatasetGraphMap approach, one other thought that occurred 
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
implementation.


It loads, and clears, the mapped graph on-demand, and passes the find() 
call through to the now-setup data.


Andy

On 16/02/16 17:42, A. Soroka wrote:

Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?


No, I certainly did not mean to give that impression, and I don’t think it is 
entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
loading. That doesn’t mean it can’t be used that way, but that was not in the 
design, which assumed that all tuples take about the same amount of time to 
access and that all of the same type are coming from the same implementation 
(in a QuadTable and a TripleTable).

The overhead of mutating a dataset is mostly inside the implementations of 
TupleTable that are actually used to store tuples. You should be aware that 
TupleTable extends TransactionalComponent, so if you want to use it to create 
some kind of connection to your storage, you will need to make that connection 
fully transactional. That doesn’t sound at all trivial in your case.

At this point it seems to me that extending DatasetGraphMap (and implementing 
GraphMaker and Graph instead of TupleTable) might be a more appropriate design 
for your work. You can put dynamic loading behavior in Graph (or a GraphView 
subtype) just as easily as in TupleTable subtypes. Are there reasons around the 
use of transactionality in your work that demand the particular semantics 
supported by DSGInMemory?

---
A. Soroka
The University of Virginia Library


On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:



Hi.
The quick full scenario is a distributed DaaS which supports queries, updates, 
transforms and bulkloads. Andy Seaborne knows some of the detail because I 
spoke to him previously. We achieve multiple writes by having parallel 
Datasets, both traditional TDB and on demand in memory. Writes are sent to a 
free dataset, free being not in a write transaction. That's a simplistic 
overview...
Queries are handled by a dataset proxy which builds a dynamic dataset based on 
the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find 
method to issue the query to all known Datasets and return the union of 
results. Various dataset proxies exist, some load TDBs, others load TTL files 
into graphs, others dynamically create tuples. The common thing being they are 
all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can 
result in multiple Datasets being loaded to satisfy the query.
Nodes can be preloaded which then load Datasets to satisfy finds. This way the 
system can be scaled to handle increased work loads. Also specific nodes can be 
targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's structure. So 
urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
foo to be dynamically loaded into memory along with the quads which are 
required to satisfy the find.
Typically a group of people will be working on a set of data so the first to 
query will load the dataset then it will be accessed multiple times. There will 
be an initial dynamic load of data which will tail off with some additional 
loading over time.
Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?
A typical scenario would be to search all SDAI repository's for some key 
information then load detailed information in some, continuing to drill down.
Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've already 
shimed the DGTriplesQuads so the actual caching code already exists and should 
bed easy to hook on.
Dick

 Original message 
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016  11:07 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory

Okay, I’m more confident at this point that you’re not well served by 
DatasetGraphInMemory, which has very strong assumptions about the speedy 
reachability of data. DSGInMemory was built for situations when all of the data 
is in core memory and multithreaded access is important. If you have a lot of 
core memory and can load the data fully, you might want to use it, but that 
doesn’t sound at all like your case. Otherwise, as far as what the right 
extension point is, I will need to defer to committers or more experienced 
devs, but I think you may need to look at Da

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-16 Thread A. Soroka
> Based on your description the DatasetGraphInMemory would seem to match the 
> dynamic load requirement. How did you foresee it being loaded? Is there a 
> large over head to using the add methods?

No, I certainly did not mean to give that impression, and I don’t think it is 
entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
loading. That doesn’t mean it can’t be used that way, but that was not in the 
design, which assumed that all tuples take about the same amount of time to 
access and that all of the same type are coming from the same implementation 
(in a QuadTable and a TripleTable).

The overhead of mutating a dataset is mostly inside the implementations of 
TupleTable that are actually used to store tuples. You should be aware that 
TupleTable extends TransactionalComponent, so if you want to use it to create 
some kind of connection to your storage, you will need to make that connection 
fully transactional. That doesn’t sound at all trivial in your case.

At this point it seems to me that extending DatasetGraphMap (and implementing 
GraphMaker and Graph instead of TupleTable) might be a more appropriate design 
for your work. You can put dynamic loading behavior in Graph (or a GraphView 
subtype) just as easily as in TupleTable subtypes. Are there reasons around the 
use of transactionality in your work that demand the particular semantics 
supported by DSGInMemory?

---
A. Soroka
The University of Virginia Library

> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
> 
> 
> 
> Hi.
> The quick full scenario is a distributed DaaS which supports queries, 
> updates, transforms and bulkloads. Andy Seaborne knows some of the detail 
> because I spoke to him previously. We achieve multiple writes by having 
> parallel Datasets, both traditional TDB and on demand in memory. Writes are 
> sent to a free dataset, free being not in a write transaction. That's a 
> simplistic overview...
> Queries are handled by a dataset proxy which builds a dynamic dataset based 
> on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy 
> find method to issue the query to all known Datasets and return the union of 
> results. Various dataset proxies exist, some load TDBs, others load TTL files 
> into graphs, others dynamically create tuples. The common thing being they 
> are all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can 
> result in multiple Datasets being loaded to satisfy the query.
> Nodes can be preloaded which then load Datasets to satisfy finds. This way 
> the system can be scaled to handle increased work loads. Also specific nodes 
> can be targeted to specific hardware.
> When a graph URI is encountered the proxy can interpret it's structure. So 
> urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
> foo to be dynamically loaded into memory along with the quads which are 
> required to satisfy the find.
> Typically a group of people will be working on a set of data so the first to 
> query will load the dataset then it will be accessed multiple times. There 
> will be an initial dynamic load of data which will tail off with some 
> additional loading over time.
> Based on your description the DatasetGraphInMemory would seem to match the 
> dynamic load requirement. How did you foresee it being loaded? Is there a 
> large over head to using the add methods?
> A typical scenario would be to search all SDAI repository's for some key 
> information then load detailed information in some, continuing to drill down.
> Hope this helps.
> I'm going to extend the hex and tri tables and run some tests. I've already 
> shimed the DGTriplesQuads so the actual caching code already exists and 
> should bed easy to hook on.
> Dick
> 
>  Original message ----
> From: "A. Soroka" <aj...@virginia.edu> 
> Date: 12/02/2016  11:07 pm  (GMT+00:00) 
> To: users@jena.apache.org 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
> DatasetGraphInMemory 
> 
> Okay, I’m more confident at this point that you’re not well served by 
> DatasetGraphInMemory, which has very strong assumptions about the speedy 
> reachability of data. DSGInMemory was built for situations when all of the 
> data is in core memory and multithreaded access is important. If you have a 
> lot of core memory and can load the data fully, you might want to use it, but 
> that doesn’t sound at all like your case. Otherwise, as far as what the right 
> extension point is, I will need to defer to committers or more experienced 
> devs, but I think you may need to look at DatasetGraph from a more 
> close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for 
> example.
> 
> Can you tell us a bit mor

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-13 Thread Joint


Hi.
The quick full scenario is a distributed DaaS which supports queries, updates, 
transforms and bulkloads. Andy Seaborne knows some of the detail because I 
spoke to him previously. We achieve multiple writes by having parallel 
Datasets, both traditional TDB and on demand in memory. Writes are sent to a 
free dataset, free being not in a write transaction. That's a simplistic 
overview...
Queries are handled by a dataset proxy which builds a dynamic dataset based on 
the graph URIs. For example the graph URI urn:Iungo:all causes the proxy find 
method to issue the query to all known Datasets and return the union of 
results. Various dataset proxies exist, some load TDBs, others load TTL files 
into graphs, others dynamically create tuples. The common thing being they are 
all presented as Datasets backed by DatasetGraph. Thus a SPARQL query can 
result in multiple Datasets being loaded to satisfy the query.
Nodes can be preloaded which then load Datasets to satisfy finds. This way the 
system can be scaled to handle increased work loads. Also specific nodes can be 
targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's structure. So 
urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
foo to be dynamically loaded into memory along with the quads which are 
required to satisfy the find.
Typically a group of people will be working on a set of data so the first to 
query will load the dataset then it will be accessed multiple times. There will 
be an initial dynamic load of data which will tail off with some additional 
loading over time.
Based on your description the DatasetGraphInMemory would seem to match the 
dynamic load requirement. How did you foresee it being loaded? Is there a large 
over head to using the add methods?
A typical scenario would be to search all SDAI repository's for some key 
information then load detailed information in some, continuing to drill down.
Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've already 
shimed the DGTriplesQuads so the actual caching code already exists and should 
bed easy to hook on.
Dick

 Original message 
From: "A. Soroka" <aj...@virginia.edu> 
Date: 12/02/2016  11:07 pm  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory 

Okay, I’m more confident at this point that you’re not well served by 
DatasetGraphInMemory, which has very strong assumptions about the speedy 
reachability of data. DSGInMemory was built for situations when all of the data 
is in core memory and multithreaded access is important. If you have a lot of 
core memory and can load the data fully, you might want to use it, but that 
doesn’t sound at all like your case. Otherwise, as far as what the right 
extension point is, I will need to defer to committers or more experienced 
devs, but I think you may need to look at DatasetGraph from a more 
close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for 
example.

Can you tell us a bit more about your full scenario? I don’t know much about 
STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of 
queries are you going to be using with this data? How quickly are users going 
to need to switch contexts between datasets?

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:
> 
> 
> 
> Thanks for the fast response!
>  I have a set of disk based binary SDAI repository's which are based on 
>ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
>particular my files are IFC2x3 files which can be +1Gb. However after 
>processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP 
>file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get 
>+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP 
>files...
> Typically only a small subset of the STEP file needs to be queried but 
> sometimes other parts need to be queried. Hence the on demand caching and 
> DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
> call the native SDAI find methods based on the node URI's in the case of a 
> cache miss, calling the add methods for the minted tuples, then passing on 
> the call to the super find. The underlying SDAI repository's are static so 
> once a subject is cached no other work is required.
> As the DatasetGraphInMemory is commented as very fast quad and triple access 
> it seemed a logical place to extend. The shim cache would be set to expire 
> entries and limit the total number of tuples power repository. This is 
> currently deployed on a 256Gb ram device.
> In the bigger picture l have a service very similar to Fuseki which allows 
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-12 Thread A. Soroka
I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
better answered by other folks who are more familiar with Jena's DatasetGraph 
implementations, or may actually not have anything to do with DatasetGraph (see 
below for why). I will try to give some background information, though.

There are several paths by which where DatasetGraphInMemory can be performing 
finds, but they come down to two places in the code, QuadTable:: and 
TripleTable::find and in default operation, the concrete forms:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100

for Quads and

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99

for Triples. Those methods are reused by all the differently-ordered indexes 
within Hex- or TriTable, each of which will answer a find by selecting an 
appropriately-ordered index based on the fixed and variable slots in the find 
pattern and using the concrete methods above to stream tuples back.

As to why you are seeing your methods called in some places and not in others, 
DatasetGraphBaseFind features methods like findInDftGraph(), 
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the 
methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a 
selection between those methods— that is done by DatasetGraphBaseFind. So that 
is where you will find the logic that should answer your question.

Can you say a little more about your use case? You seem to have some efficient 
representation in memory of your data (I hope it is in-memory— otherwise it is 
a very bad choice to subclass DSGInMemory) and you want to create tuples on the 
fly as queries are received. That is really not at all what DSGInMemory is for 
(DSGInMemory is using map structures for indexing and in default mode, uses 
persistent data structures to support transactionality). I am wondering whether 
you might not be much better served by tapping into Jena at a different place, 
perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the 
right choice, just implementing Quad- and TripleTable and using the constructor 
DatasetGraphInMemory(final QuadTable i, final TripleTable t).

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 12:58 PM, Dick Murray  wrote:
> 
> Hi.
> 
> Does anyone know the "find" paths through DatasetGraphInMemory please?
> 
> For example if I extend DatasetGraphInMemory and override
> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
> * where {?s ?p ?o}" however if I override the other
> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
> calling (but as I type I'm guessing it's optimised to return the HexTable
> nodes...).
> 
> Would I be better off overriding HexTable and TriTable classes find methods
> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
> one of these methods?
> 
> I need to know the root find methods so that I can shim them to create
> triples/quads before they perform the find.
> 
> I need to create Triples/Quads on demand (because a bulk load would create
> ~100M triples but only ~1000 are ever queried) and the source binary form
> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
> than quads.
> 
> Regards Dick Murray.



Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-12 Thread Joint


Thanks for the fast response!
 I have a set of disk based binary SDAI repository's which are based on 
ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
particular my files are IFC2x3 files which can be +1Gb. However after 
processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb STEP 
file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB I get 
+100M quads and a 50Gb folder. Multiplied by 1000's of similar sized STEP 
files...
Typically only a small subset of the STEP file needs to be queried but 
sometimes other parts need to be queried. Hence the on demand caching and 
DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
call the native SDAI find methods based on the node URI's in the case of a 
cache miss, calling the add methods for the minted tuples, then passing on the 
call to the super find. The underlying SDAI repository's are static so once a 
subject is cached no other work is required.
As the DatasetGraphInMemory is commented as very fast quad and triple access it 
seemed a logical place to extend. The shim cache would be set to expire entries 
and limit the total number of tuples power repository. This is currently 
deployed on a 256Gb ram device.
In the bigger picture l have a service very similar to Fuseki which allows 
SPARQL requests to be made against Datasets which are either TDB or SDAI cache 
backed.
What was DatasetGraphInMemory created for..? ;-)
Dick

 Original message 
From: "A. Soroka" <aj...@virginia.edu> 
Date: 12/02/2016  6:21 pm  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
DatasetGraphInMemory 

I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
better answered by other folks who are more familiar with Jena's DatasetGraph 
implementations, or may actually not have anything to do with DatasetGraph (see 
below for why). I will try to give some background information, though.

There are several paths by which where DatasetGraphInMemory can be performing 
finds, but they come down to two places in the code, QuadTable:: and 
TripleTable::find and in default operation, the concrete forms:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100

for Quads and

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99

for Triples. Those methods are reused by all the differently-ordered indexes 
within Hex- or TriTable, each of which will answer a find by selecting an 
appropriately-ordered index based on the fixed and variable slots in the find 
pattern and using the concrete methods above to stream tuples back.

As to why you are seeing your methods called in some places and not in others, 
DatasetGraphBaseFind features methods like findInDftGraph(), 
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are the 
methods that DatasetGraphInMemory is implementing. DSGInMemory does not make a 
selection between those methods— that is done by DatasetGraphBaseFind. So that 
is where you will find the logic that should answer your question.

Can you say a little more about your use case? You seem to have some efficient 
representation in memory of your data (I hope it is in-memory— otherwise it is 
a very bad choice to subclass DSGInMemory) and you want to create tuples on the 
fly as queries are received. That is really not at all what DSGInMemory is for 
(DSGInMemory is using map structures for indexing and in default mode, uses 
persistent data structures to support transactionality). I am wondering whether 
you might not be much better served by tapping into Jena at a different place, 
perhaps implementing the Graph SPI directly. Or, if reusing DSGInMemory is the 
right choice, just implementing Quad- and TripleTable and using the constructor 
DatasetGraphInMemory(final QuadTable i, final TripleTable t).

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com> wrote:
> 
> Hi.
> 
> Does anyone know the "find" paths through DatasetGraphInMemory please?
> 
> For example if I extend DatasetGraphInMemory and override
> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
> * where {?s ?p ?o}" however if I override the other
> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
> calling (but as I type I'm guessing it's optimised to return the HexTable
> nodes...).
> 
> Would I be better off overriding HexTable and TriTable classes find methods
> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
> one of these methods?
> 
> I need to know

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

2016-02-12 Thread A. Soroka
Okay, I’m more confident at this point that you’re not well served by 
DatasetGraphInMemory, which has very strong assumptions about the speedy 
reachability of data. DSGInMemory was built for situations when all of the data 
is in core memory and multithreaded access is important. If you have a lot of 
core memory and can load the data fully, you might want to use it, but that 
doesn’t sound at all like your case. Otherwise, as far as what the right 
extension point is, I will need to defer to committers or more experienced 
devs, but I think you may need to look at DatasetGraph from a more 
close-to-the-metal point. TDB extends DatasetGraphTriplesQuads directly, for 
example.

Can you tell us a bit more about your full scenario? I don’t know much about 
STEP (sorry if others do)— is there a canonical RDF formulation? What kinds of 
queries are you going to be using with this data? How quickly are users going 
to need to switch contexts between datasets?

---
A. Soroka
The University of Virginia Library

> On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:
> 
> 
> 
> Thanks for the fast response!
>  I have a set of disk based binary SDAI repository's which are based on 
> ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. In 
> particular my files are IFC2x3 files which can be +1Gb. However after 
> processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb 
> STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB 
> I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar sized 
> STEP files...
> Typically only a small subset of the STEP file needs to be queried but 
> sometimes other parts need to be queried. Hence the on demand caching and 
> DatasetGraphInMemory. The aim is that in the find methods I check a cache and 
> call the native SDAI find methods based on the node URI's in the case of a 
> cache miss, calling the add methods for the minted tuples, then passing on 
> the call to the super find. The underlying SDAI repository's are static so 
> once a subject is cached no other work is required.
> As the DatasetGraphInMemory is commented as very fast quad and triple access 
> it seemed a logical place to extend. The shim cache would be set to expire 
> entries and limit the total number of tuples power repository. This is 
> currently deployed on a 256Gb ram device.
> In the bigger picture l have a service very similar to Fuseki which allows 
> SPARQL requests to be made against Datasets which are either TDB or SDAI 
> cache backed.
> What was DatasetGraphInMemory created for..? ;-)
> Dick
> 
>  Original message 
> From: "A. Soroka" <aj...@virginia.edu> 
> Date: 12/02/2016  6:21 pm  (GMT+00:00) 
> To: users@jena.apache.org 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
> DatasetGraphInMemory 
> 
> I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
> better answered by other folks who are more familiar with Jena's DatasetGraph 
> implementations, or may actually not have anything to do with DatasetGraph 
> (see below for why). I will try to give some background information, though.
> 
> There are several paths by which where DatasetGraphInMemory can be performing 
> finds, but they come down to two places in the code, QuadTable:: and 
> TripleTable::find and in default operation, the concrete forms:
> 
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
> 
> for Quads and
> 
> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
> 
> for Triples. Those methods are reused by all the differently-ordered indexes 
> within Hex- or TriTable, each of which will answer a find by selecting an 
> appropriately-ordered index based on the fixed and variable slots in the find 
> pattern and using the concrete methods above to stream tuples back.
> 
> As to why you are seeing your methods called in some places and not in 
> others, DatasetGraphBaseFind features methods like findInDftGraph(), 
> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are 
> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not 
> make a selection between those methods— that is done by DatasetGraphBaseFind. 
> So that is where you will find the logic that should answer your question.
> 
> Can you say a little more about your use case? You seem to have some 
> efficient representation in memory of your data (I hope it is in-memory— 
> otherwise it is a very bad choice to subclass DSGInMemory) and you want to 
> create tuples on the fly as queries are received. That is really not a