Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Andy Seaborne Thu, 10 Mar 2016 04:08:58 -0800

Hi Dick,

Thanks for the details.

So TriTable is used as the internal implementation of a cachingread-only graph and you're using the loop form for GRAPH (and often theloop is one URI - i.e. directed to one part of the data). UsingTriTable is because it's a convenient triple storage for the use case.


The two interesting pieces to Jena:

1/ support for writing dynamic adapters

2/ a graph (DatasetGraph) implementation that more clearly has aninterface for storage.



On the latter: I've come across this before and sketched this interface.

It's nothing more than a first pass sketch. Is this the sort of thingthat might work for your use case? (a graph storage version with quadsover the top as a subcase):


interface StorageRDF {
    default void add(Triple triple) { .... }
    default void add(Quad quad)     { .... }

    default void delete(Triple triple)  { .... }
    default void delete(Quad quad)      { .... }

    void add(Node s, Node p, Node o) ;
    void add(Node g, Node s, Node p, Node o) ;

    void delete(Node s, Node p, Node o) ;
    void delete(Node g, Node s, Node p, Node o) ;

    /** Delete all triples matching a {@code find}-like pattern */
    void removeAll(Node s, Node p, Node o) ;
    /** Delete all quads matching a {@code find}-like pattern */
    void removeAll(Node g, Node s, Node p, Node o) ;

    // NB Quads
    Stream<Quad>   findDftGraph(Node s, Node p, Node o) ;
    Stream<Quad>   findUnionGraph(Node s, Node p, Node o) ;
    Stream<Quad>   find(Node g, Node s, Node p, Node o) ;
    // For findUnion.
    Stream<Quad>   findDistinct(Node g, Node s, Node p, Node o) ;

    // triples
    Stream<Triple> find(Node s, Node p, Node o) ;

//    default Stream<Triple> find(Node s, Node p, Node o) {
//        return findDftGraph(s,p,o).map(Quad::asTriple) ;
//    }

//    Iterator<Quad>   findUnionGraph(Node s, Node p, Node o) ;
//    Iterator<Quad>   find(Node g, Node s, Node p, Node o) ;


    // contains

    default boolean contains(Node s, Node p, Node o)
    { return find(s,p,o).findAny().isPresent() ; }
    default boolean contains(Node g, Node s, Node p, Node o)
    { return find(g,s,p,o).findAny().isPresent() ; }

    // Prefixes ??
}


https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/dsg2
also has the companion DatasetGraphStorage.

    Andy



On 04/03/16 12:03, Dick Murray wrote:

LOL. The perils of a succinct update with no detail!

I understand the Jena SPI supports read/writes via transactions and I also
know that the wrapper classes provide a best effort for some of the
overridden methods which do not always sit well when materializing triples.
For example DatasetGraphBase provides public boolean containsGraph(Node
graphNode) {return contains(graphNode, Node.ANY, Node.ANY, Node.ANY);}
which results in a call to DatasetGraphBaseFind public Iterator<Quad>
find(Node g, Node s, Node p, Node o) which might end up with something
being called in DatasetGraphInMemory depending on what has been extended
and overridden. This causes a problem for me because I shim the finds to
decide whether the triples have been materialized before calling the
overridden find. After extending DatasetGraphTriples and
DatasetGraphInMemory I realised that I had overridden most of the methods
so I stopped and implemented DatasetGraph and Transactional.

In my scenario the underlying data (a vendor agnostic format to get
AutoCAD, Bentley, etc to work together) is never changed so the
DatasetGraph need not support writes. Whilst we need to provide semantic
access to the these files they result in ~100M triples each if transformed,
there are 1000's of files, they can change multiple times per day and the
various disciplines typically only require a subset of triples.

That said in my DatasetGraph implementation if you call
begin(ReadWrite.WRITE) it throw a UOE. The same is true for the Graph
implementation in that it does not support external writes (throws UOE) but
does implement writes internally (via TriTable) because it needs to write
the materialized triples to answer the find.

So if we take

select ?s
where {graph <urn:iungo:iso/10303/22/repository/r/model/m> {?s a
<urn:iungo:iso/10303/11/schema/s/entity/e>}

Jena via the SPARQL query engine will perform the following abridged
process.

    - Jena begins a DG read transaction.
    - Jena calls DG find(<urn:iungo:iso/10303/22/repository/r/model/m>, ANY,
    a <urn:iungo:iso/10303/11/schema/s/entity/e>).
    - DG will;
       - check if the repository r has been loaded, i.e. matching the
       repository name URI spec fragment to a repository file on disk
and loading
       it into the SDAI session.
       - check if the model m has been loaded, i.e. matching the model name
       URI spec fragment to a repository model and loading it into the SDAI
       session.
          - If we have just loaded the SDAI model check if there is any pre
          caching to be done which is just a set of find triples which
are handled as
          per the normal find detailed following.
       - We now have a G which wraps the SDAI model and uses TriTable to
    hold materialized triples.
    - DG will now call G.find(ANY, a
    <urn:iungo:iso/10303/11/schema/s/entity/e>).
    - G will check the find triple against a set of already materialized
    find triples and if it misses;
       - G will search a set of triple handles which know how to materialize
       triples for a given find triple and if found;
          - G begins a TriTable write transaction and for {ANY, a
          <urn:iungo:iso/10303/11/schema/s/entity/e>} (i.e the DG & G
are READ but
          the G TriTable is WRITE);
             - Check the find triples again we might have been in a race for
             the find triple and lost...
             - Load the correct Java class for entity e which involves
             minting the FQCN using the schema s and entity e e.g.
ifc2x3 and ifcslab
             become org.jsdai.ifc2x3.ifcslab.
             - Use this to call the SDAI method findInstances(Class<?
             extends Entity> entityClass) which returns zero or more
SDAI entities from
             which we;
                - Query the ifc2x3 schema to list the explicit Entity
                attributes and for each we add a triple to TriTable e.g.
                ifcslab:ifcorganization =
                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
                
<urn:iungo:iso/10303/11/schema/ifc2x3/entity/ifcslab/attribute/ifcorganization>
<urn:iungo:iso/10303/21/repository/r/model/m/instance/1>}
                - In addition we add the triple
                {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100> a
                <urn:iungo:iso/10303/11/schema/s/entity/e>}.
                - If we are creating linked triples (i.e. max depth > 1)
                then for each attribute which has a SDAI entity
instance value call the
                appropriate handle to create the triples.
             - G commits the TriTable write transaction (make the triples
          visible before we update the find triples!).
          - G updates the find triples to include;
          - {ANY, a <urn:iungo:iso/10303/11/schema/s/entity/e>}
             - {<urn:iungo:iso/10303/21/repository/r/model/m/instance/100>
             ANY ANY}
             - Repeat the above for any linked triples created.
             - The TriTable now contains the triples required to answer the
          find triple.
          - G will return TriTable.find(ANY, a
    <urn:iungo:iso/10303/11/schema/s/entity/e>)
    - Jena ends the DG read transaction.


Some find triples will result in the appropriate handle being called
(handle hit) which will create triples. Others will handle miss and be
passed on to the TriTable find (e.g. no triples created and TriTable will
return nothing). A few will result in a UOE {ANY, ANY, ANY} being an
example because does this mean create all of the triples (+100M) or all of
the currently created triples (which relies on having queried what you need
to ANY!). Currently we only UOE on {ANY ANY ANY} and is it really useful to
ask this find?

Hope that clear up the "writes are not supported" (the underlying data is
read only) and why the TupleTable subtypes are not problematic. I could
have held the created triples per find triple but that wouldn't scale with
duplication plus why recreate the wheel when if I'm not mistaken TriTable
uses the dexx collection giving subsequent HAMT advantages which is what a
high performance in memory implementation requires. The solution is working
and compared to a fully transformed TDB is giving the correct results. To
do might include timing out the G when they have not been accessed for a
period of time...

Finally having wrote the wrapper I thought it wouldn't be used anywhere
else but subsequently it was used to abstract an existing system where
adhoc semantic access was required and it's lined to do a similar task on
two other data silos. Hence the question to Andy regarding a Jena cached
SPI package.

Thanks again for your help Adam/Andy.

Dick.



On 4 March 2016 at 01:36, A. Soroka <aj...@virginia.edu> wrote:

I’m confused about two of your points here. Let me separate them out so we
can discuss them easily.

1) "writes are not supported”:

Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
and DatasetGraph are the basic abstractions implemented by Jena’s own
out-of-the-box implementations of RDF storage. Can you explain what you
mean by this?

2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
triple caching algorithm”:

The subtypes of TupleTable with which you are working have exactly the
same kinds of find() methods. Why are they not problematic in that context?

---
A. Soroka
The University of Virginia Library

On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:



Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got

to the point where I had overridden more methods than not. In addition
writes are not supported and contains methods which call find(ANY, ANY,
ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
the TriTable because it fits and quads are spoofed via triple to quad
iterator.

I have a set of filters and handles which the find triple is compared

against and either passed straight to the TriTable if the triple has been
handled before or its passed to the appropriate handle which adds the
triples to the TriTable then calls the find. As the underlying data is a
tree a cache depth can be set which allows related triples to be cached.
Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.

Would you consider a generic version for the Jena code base?

Dick

-------- Original message --------
From: Andy Seaborne <a...@apache.org>
Date: 18/02/2016  6:31 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory

Hi,

I'm not seeing how tapping into the implementation of
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find()
call through to the now-setup data.

       Andy

On 16/02/16 17:42, A. Soroka wrote:

Based on your description the DatasetGraphInMemory would seem to match

the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?


No, I certainly did not mean to give that impression, and I don’t think

it is entirely accurate. DSGInMemory was definitely not at all meant for
dynamic loading. That doesn’t mean it can’t be used that way, but that was
not in the design, which assumed that all tuples take about the same amount
of time to access and that all of the same type are coming from the same
implementation (in a QuadTable and a TripleTable).


The overhead of mutating a dataset is mostly inside the implementations

of TupleTable that are actually used to store tuples. You should be aware
that TupleTable extends TransactionalComponent, so if you want to use it to
create some kind of connection to your storage, you will need to make that
connection fully transactional. That doesn’t sound at all trivial in your
case.


At this point it seems to me that extending DatasetGraphMap (and

implementing GraphMaker and Graph instead of TupleTable) might be a more
appropriate design for your work. You can put dynamic loading behavior in
Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
Are there reasons around the use of transactionality in your work that
demand the particular semantics supported by DSGInMemory?


---
A. Soroka
The University of Virginia Library

On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:



Hi.
The quick full scenario is a distributed DaaS which supports queries,

updates, transforms and bulkloads. Andy Seaborne knows some of the detail
because I spoke to him previously. We achieve multiple writes by having
parallel Datasets, both traditional TDB and on demand in memory. Writes are
sent to a free dataset, free being not in a write transaction. That's a
simplistic overview...

Queries are handled by a dataset proxy which builds a dynamic dataset

based on the graph URIs. For example the graph URI urn:Iungo:all causes the
proxy find method to issue the query to all known Datasets and return the
union of results. Various dataset proxies exist, some load TDBs, others
load TTL files into graphs, others dynamically create tuples. The common
thing being they are all presented as Datasets backed by DatasetGraph. Thus
a SPARQL query can result in multiple Datasets being loaded to satisfy the
query.

Nodes can be preloaded which then load Datasets to satisfy finds. This

way the system can be scaled to handle increased work loads. Also specific
nodes can be targeted to specific hardware.

When a graph URI is encountered the proxy can interpret it's

structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
SDAI repository foo to be dynamically loaded into memory along with the
quads which are required to satisfy the find.

Typically a group of people will be working on a set of data so the

first to query will load the dataset then it will be accessed multiple
times. There will be an initial dynamic load of data which will tail off
with some additional loading over time.

Based on your description the DatasetGraphInMemory would seem to match

the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?

A typical scenario would be to search all SDAI repository's for some

key information then load detailed information in some, continuing to drill
down.

Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've

already shimed the DGTriplesQuads so the actual caching code already exists
and should bed easy to hook on.

Dick

-------- Original message --------
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016  11:07 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using

DatasetGraphInMemory


Okay, I’m more confident at this point that you’re not well served by

DatasetGraphInMemory, which has very strong assumptions about the speedy
reachability of data. DSGInMemory was built for situations when all of the
data is in core memory and multithreaded access is important. If you have a
lot of core memory and can load the data fully, you might want to use it,
but that doesn’t sound at all like your case. Otherwise, as far as what the
right extension point is, I will need to defer to committers or more
experienced devs, but I think you may need to look at DatasetGraph from a
more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
directly, for example.


Can you tell us a bit more about your full scenario? I don’t know much

about STEP (sorry if others do)— is there a canonical RDF formulation? What
kinds of queries are you going to be using with this data? How quickly are
users going to need to switch contexts between datasets?


---
A. Soroka
The University of Virginia Library

On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:



Thanks for the fast response!
     I have a set of disk based binary SDAI repository's which are

based on ISO10303 parts 11/21/25/27 otherwise known as the
EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
be +1Gb. However after processing into a SDAI binary I typically see a size
reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
1000's of similar sized STEP files...

Typically only a small subset of the STEP file needs to be queried

but sometimes other parts need to be queried. Hence the on demand caching
and DatasetGraphInMemory. The aim is that in the find methods I check a
cache and call the native SDAI find methods based on the node URI's in the
case of a cache miss, calling the add methods for the minted tuples, then
passing on the call to the super find. The underlying SDAI repository's are
static so once a subject is cached no other work is required.

As the DatasetGraphInMemory is commented as very fast quad and triple

access it seemed a logical place to extend. The shim cache would be set to
expire entries and limit the total number of tuples power repository. This
is currently deployed on a 256Gb ram device.

In the bigger picture l have a service very similar to Fuseki which

allows SPARQL requests to be made against Datasets which are either TDB or
SDAI cache backed.

What was DatasetGraphInMemory created for..? ;-)
Dick

-------- Original message --------
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016  6:21 pm  (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using

DatasetGraphInMemory


I wrote the DatasetGraphInMemory  code, but I suspect your question

may be better answered by other folks who are more familiar with Jena's
DatasetGraph implementations, or may actually not have anything to do with
DatasetGraph (see below for why). I will try to give some background
information, though.


There are several paths by which where DatasetGraphInMemory can be

performing finds, but they come down to two places in the code, QuadTable::
and TripleTable::find and in default operation, the concrete forms:

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100


for Quads and

https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99


for Triples. Those methods are reused by all the differently-ordered

indexes within Hex- or TriTable, each of which will answer a find by
selecting an appropriately-ordered index based on the fixed and variable
slots in the find pattern and using the concrete methods above to stream
tuples back.


As to why you are seeing your methods called in some places and not

in others, DatasetGraphBaseFind features methods like findInDftGraph(),
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
make a selection between those methods— that is done by
DatasetGraphBaseFind. So that is where you will find the logic that should
answer your question.


Can you say a little more about your use case? You seem to have some

efficient representation in memory of your data (I hope it is in-memory—
otherwise it is a very bad choice to subclass DSGInMemory) and you want to
create tuples on the fly as queries are received. That is really not at all
what DSGInMemory is for (DSGInMemory is using map structures for indexing
and in default mode, uses persistent data structures to support
transactionality). I am wondering whether you might not be much better
served by tapping into Jena at a different place, perhaps implementing the
Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
implementing Quad- and TripleTable and using the constructor
DatasetGraphInMemory(final QuadTable i, final TripleTable t).


---
A. Soroka
The University of Virginia Library

On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com>

wrote:


Hi.

Does anyone know the "find" paths through DatasetGraphInMemory

please?


For example if I extend DatasetGraphInMemory and override
DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on

"select

* where {?s ?p ?o}" however if I override the other
DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g

{?s ?p

?o}}" does not trigger a breakpoint i.e. I don't know what method

it's

calling (but as I type I'm guessing it's optimised to return the

HexTable

nodes...).

Would I be better off overriding HexTable and TriTable classes find

methods

when I create the DatasetGraphInMemory? Are all finds guaranteed to

end in

one of these methods?

I need to know the root find methods so that I can shim them to

create

triples/quads before they perform the find.

I need to create Triples/Quads on demand (because a bulk load would

create

~100M triples but only ~1000 are ever queried) and the source binary

form

is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M

quads)

than quads.

Regards Dick Murray.

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to