I’m confused about two of your points here. Let me separate them out so we
can discuss them easily.
1) "writes are not supported”:
Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add
and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph
and DatasetGraph are the basic abstractions implemented by Jena’s own
out-of-the-box implementations of RDF storage. Can you explain what you
mean by this?
2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand
triple caching algorithm”:
The subtypes of TupleTable with which you are working have exactly the
same kinds of find() methods. Why are they not problematic in that context?
---
A. Soroka
The University of Virginia Library
On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got
to the point where I had overridden more methods than not. In addition
writes are not supported and contains methods which call find(ANY, ANY,
ANY) play havoc with an on demand triple caching algorithm! ;-) I'm using
the TriTable because it fits and quads are spoofed via triple to quad
iterator.
I have a set of filters and handles which the find triple is compared
against and either passed straight to the TriTable if the triple has been
handled before or its passed to the appropriate handle which adds the
triples to the TriTable then calls the find. As the underlying data is a
tree a cache depth can be set which allows related triples to be cached.
Also the cache can be preloaded with common triples e.g. ANY RDF:type ?.
Would you consider a generic version for the Jena code base?
Dick
-------- Original message --------
From: Andy Seaborne <a...@apache.org>
Date: 18/02/2016 6:31 pm (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
DatasetGraphInMemory
Hi,
I'm not seeing how tapping into the implementation of
DatasetGraphInMemory is going to help (through the details
As well as the DatasetGraphMap approach, one other thought that occurred
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph
implementation.
It loads, and clears, the mapped graph on-demand, and passes the find()
call through to the now-setup data.
Andy
On 16/02/16 17:42, A. Soroka wrote:
Based on your description the DatasetGraphInMemory would seem to match
the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?
No, I certainly did not mean to give that impression, and I don’t think
it is entirely accurate. DSGInMemory was definitely not at all meant for
dynamic loading. That doesn’t mean it can’t be used that way, but that was
not in the design, which assumed that all tuples take about the same amount
of time to access and that all of the same type are coming from the same
implementation (in a QuadTable and a TripleTable).
The overhead of mutating a dataset is mostly inside the implementations
of TupleTable that are actually used to store tuples. You should be aware
that TupleTable extends TransactionalComponent, so if you want to use it to
create some kind of connection to your storage, you will need to make that
connection fully transactional. That doesn’t sound at all trivial in your
case.
At this point it seems to me that extending DatasetGraphMap (and
implementing GraphMaker and Graph instead of TupleTable) might be a more
appropriate design for your work. You can put dynamic loading behavior in
Graph (or a GraphView subtype) just as easily as in TupleTable subtypes.
Are there reasons around the use of transactionality in your work that
demand the particular semantics supported by DSGInMemory?
---
A. Soroka
The University of Virginia Library
On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
Hi.
The quick full scenario is a distributed DaaS which supports queries,
updates, transforms and bulkloads. Andy Seaborne knows some of the detail
because I spoke to him previously. We achieve multiple writes by having
parallel Datasets, both traditional TDB and on demand in memory. Writes are
sent to a free dataset, free being not in a write transaction. That's a
simplistic overview...
Queries are handled by a dataset proxy which builds a dynamic dataset
based on the graph URIs. For example the graph URI urn:Iungo:all causes the
proxy find method to issue the query to all known Datasets and return the
union of results. Various dataset proxies exist, some load TDBs, others
load TTL files into graphs, others dynamically create tuples. The common
thing being they are all presented as Datasets backed by DatasetGraph. Thus
a SPARQL query can result in multiple Datasets being loaded to satisfy the
query.
Nodes can be preloaded which then load Datasets to satisfy finds. This
way the system can be scaled to handle increased work loads. Also specific
nodes can be targeted to specific hardware.
When a graph URI is encountered the proxy can interpret it's
structure. So urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the
SDAI repository foo to be dynamically loaded into memory along with the
quads which are required to satisfy the find.
Typically a group of people will be working on a set of data so the
first to query will load the dataset then it will be accessed multiple
times. There will be an initial dynamic load of data which will tail off
with some additional loading over time.
Based on your description the DatasetGraphInMemory would seem to match
the dynamic load requirement. How did you foresee it being loaded? Is there
a large over head to using the add methods?
A typical scenario would be to search all SDAI repository's for some
key information then load detailed information in some, continuing to drill
down.
Hope this helps.
I'm going to extend the hex and tri tables and run some tests. I've
already shimed the DGTriplesQuads so the actual caching code already exists
and should bed easy to hook on.
Dick
-------- Original message --------
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016 11:07 pm (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
DatasetGraphInMemory
Okay, I’m more confident at this point that you’re not well served by
DatasetGraphInMemory, which has very strong assumptions about the speedy
reachability of data. DSGInMemory was built for situations when all of the
data is in core memory and multithreaded access is important. If you have a
lot of core memory and can load the data fully, you might want to use it,
but that doesn’t sound at all like your case. Otherwise, as far as what the
right extension point is, I will need to defer to committers or more
experienced devs, but I think you may need to look at DatasetGraph from a
more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads
directly, for example.
Can you tell us a bit more about your full scenario? I don’t know much
about STEP (sorry if others do)— is there a canonical RDF formulation? What
kinds of queries are you going to be using with this data? How quickly are
users going to need to switch contexts between datasets?
---
A. Soroka
The University of Virginia Library
On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:
Thanks for the fast response!
I have a set of disk based binary SDAI repository's which are
based on ISO10303 parts 11/21/25/27 otherwise known as the
EXPRESS/STEP/SDAI parts. In particular my files are IFC2x3 files which can
be +1Gb. However after processing into a SDAI binary I typically see a size
reduction e.g. 1.4Gb STEP file becomes a 1Gb SDAI repository. If I convert
the STEP file into TDB I get +100M quads and a 50Gb folder. Multiplied by
1000's of similar sized STEP files...
Typically only a small subset of the STEP file needs to be queried
but sometimes other parts need to be queried. Hence the on demand caching
and DatasetGraphInMemory. The aim is that in the find methods I check a
cache and call the native SDAI find methods based on the node URI's in the
case of a cache miss, calling the add methods for the minted tuples, then
passing on the call to the super find. The underlying SDAI repository's are
static so once a subject is cached no other work is required.
As the DatasetGraphInMemory is commented as very fast quad and triple
access it seemed a logical place to extend. The shim cache would be set to
expire entries and limit the total number of tuples power repository. This
is currently deployed on a 256Gb ram device.
In the bigger picture l have a service very similar to Fuseki which
allows SPARQL requests to be made against Datasets which are either TDB or
SDAI cache backed.
What was DatasetGraphInMemory created for..? ;-)
Dick
-------- Original message --------
From: "A. Soroka" <aj...@virginia.edu>
Date: 12/02/2016 6:21 pm (GMT+00:00)
To: users@jena.apache.org
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
DatasetGraphInMemory
I wrote the DatasetGraphInMemory code, but I suspect your question
may be better answered by other folks who are more familiar with Jena's
DatasetGraph implementations, or may actually not have anything to do with
DatasetGraph (see below for why). I will try to give some background
information, though.
There are several paths by which where DatasetGraphInMemory can be
performing finds, but they come down to two places in the code, QuadTable::
and TripleTable::find and in default operation, the concrete forms:
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
for Quads and
https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
for Triples. Those methods are reused by all the differently-ordered
indexes within Hex- or TriTable, each of which will answer a find by
selecting an appropriately-ordered index based on the fixed and variable
slots in the find pattern and using the concrete methods above to stream
tuples back.
As to why you are seeing your methods called in some places and not
in others, DatasetGraphBaseFind features methods like findInDftGraph(),
findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are
the methods that DatasetGraphInMemory is implementing. DSGInMemory does not
make a selection between those methods— that is done by
DatasetGraphBaseFind. So that is where you will find the logic that should
answer your question.
Can you say a little more about your use case? You seem to have some
efficient representation in memory of your data (I hope it is in-memory—
otherwise it is a very bad choice to subclass DSGInMemory) and you want to
create tuples on the fly as queries are received. That is really not at all
what DSGInMemory is for (DSGInMemory is using map structures for indexing
and in default mode, uses persistent data structures to support
transactionality). I am wondering whether you might not be much better
served by tapping into Jena at a different place, perhaps implementing the
Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just
implementing Quad- and TripleTable and using the constructor
DatasetGraphInMemory(final QuadTable i, final TripleTable t).
---
A. Soroka
The University of Virginia Library
On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com>
wrote:
Hi.
Does anyone know the "find" paths through DatasetGraphInMemory
please?
For example if I extend DatasetGraphInMemory and override
DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on
"select
* where {?s ?p ?o}" however if I override the other
DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g
{?s ?p
?o}}" does not trigger a breakpoint i.e. I don't know what method
it's
calling (but as I type I'm guessing it's optimised to return the
HexTable
nodes...).
Would I be better off overriding HexTable and TriTable classes find
methods
when I create the DatasetGraphInMemory? Are all finds guaranteed to
end in
one of these methods?
I need to know the root find methods so that I can shim them to
create
triples/quads before they perform the find.
I need to create Triples/Quads on demand (because a bulk load would
create
~100M triples but only ~1000 are ever queried) and the source binary
form
is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M
quads)
than quads.
Regards Dick Murray.