Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

A. Soroka Thu, 03 Mar 2016 17:36:37 -0800

I’m confused about two of your points here. Let me separate them out so we can 
discuss them easily.


1) "writes are not supported”:

Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add and 
::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph and 
DatasetGraph are the basic abstractions implemented by Jena’s own 
out-of-the-box implementations of RDF storage. Can you explain what you mean by 
this?

2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand triple 
caching algorithm”:

The subtypes of TupleTable with which you are working have exactly the same 
kinds of find() methods. Why are they not problematic in that context?

---
A. Soroka
The University of Virginia Library

> On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote:
> 
> 
> 
> Hi Andy.
> I implemented the entire SPI at the DatasetGraph and Graph level. It got to 
> the point where I had overridden more methods than not. In addition writes 
> are not supported and contains methods which call find(ANY, ANY, ANY) play 
> havoc with an on demand triple caching algorithm! ;-) I'm using the TriTable 
> because it fits and quads are spoofed via triple to quad iterator.
> I have a set of filters and handles which the find triple is compared against 
> and either passed straight to the TriTable if the triple has been handled 
> before or its passed to the appropriate handle which adds the triples to the 
> TriTable then calls the find. As the underlying data is a tree a cache depth 
> can be set which allows related triples to be cached. Also the cache can be 
> preloaded with common triples e.g. ANY RDF:type ?.
> Would you consider a generic version for the Jena code base?
> 
> 
> Dick
> 
> -------- Original message --------
> From: Andy Seaborne <a...@apache.org> 
> Date: 18/02/2016  6:31 pm  (GMT+00:00) 
> To: users@jena.apache.org 
> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
>  DatasetGraphInMemory 
> 
> Hi,
> 
> I'm not seeing how tapping into the implementation of 
> DatasetGraphInMemory is going to help (through the details
> 
> As well as the DatasetGraphMap approach, one other thought that occurred 
> to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
> implementation.
> 
> It loads, and clears, the mapped graph on-demand, and passes the find() 
> call through to the now-setup data.
> 
>       Andy
> 
> On 16/02/16 17:42, A. Soroka wrote:
>>> Based on your description the DatasetGraphInMemory would seem to match the 
>>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>>> large over head to using the add methods?
>> 
>> No, I certainly did not mean to give that impression, and I don’t think it 
>> is entirely accurate. DSGInMemory was definitely not at all meant for 
>> dynamic loading. That doesn’t mean it can’t be used that way, but that was 
>> not in the design, which assumed that all tuples take about the same amount 
>> of time to access and that all of the same type are coming from the same 
>> implementation (in a QuadTable and a TripleTable).
>> 
>> The overhead of mutating a dataset is mostly inside the implementations of 
>> TupleTable that are actually used to store tuples. You should be aware that 
>> TupleTable extends TransactionalComponent, so if you want to use it to 
>> create some kind of connection to your storage, you will need to make that 
>> connection fully transactional. That doesn’t sound at all trivial in your 
>> case.
>> 
>> At this point it seems to me that extending DatasetGraphMap (and 
>> implementing GraphMaker and Graph instead of TupleTable) might be a more 
>> appropriate design for your work. You can put dynamic loading behavior in 
>> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes. Are 
>> there reasons around the use of transactionality in your work that demand 
>> the particular semantics supported by DSGInMemory?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>>> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>> Hi.
>>> The quick full scenario is a distributed DaaS which supports queries, 
>>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail 
>>> because I spoke to him previously. We achieve multiple writes by having 
>>> parallel Datasets, both traditional TDB and on demand in memory. Writes are 
>>> sent to a free dataset, free being not in a write transaction. That's a 
>>> simplistic overview...
>>> Queries are handled by a dataset proxy which builds a dynamic dataset based 
>>> on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy 
>>> find method to issue the query to all known Datasets and return the union 
>>> of results. Various dataset proxies exist, some load TDBs, others load TTL 
>>> files into graphs, others dynamically create tuples. The common thing being 
>>> they are all presented as Datasets backed by DatasetGraph. Thus a SPARQL 
>>> query can result in multiple Datasets being loaded to satisfy the query.
>>> Nodes can be preloaded which then load Datasets to satisfy finds. This way 
>>> the system can be scaled to handle increased work loads. Also specific 
>>> nodes can be targeted to specific hardware.
>>> When a graph URI is encountered the proxy can interpret it's structure. So 
>>> urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI 
>>> repository foo to be dynamically loaded into memory along with the quads 
>>> which are required to satisfy the find.
>>> Typically a group of people will be working on a set of data so the first 
>>> to query will load the dataset then it will be accessed multiple times. 
>>> There will be an initial dynamic load of data which will tail off with some 
>>> additional loading over time.
>>> Based on your description the DatasetGraphInMemory would seem to match the 
>>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>>> large over head to using the add methods?
>>> A typical scenario would be to search all SDAI repository's for some key 
>>> information then load detailed information in some, continuing to drill 
>>> down.
>>> Hope this helps.
>>> I'm going to extend the hex and tri tables and run some tests. I've already 
>>> shimed the DGTriplesQuads so the actual caching code already exists and 
>>> should bed easy to hook on.
>>> Dick
>>> 
>>> -------- Original message --------
>>> From: "A. Soroka" <aj...@virginia.edu>
>>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>>> To: users@jena.apache.org
>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
>>> DatasetGraphInMemory
>>> 
>>> Okay, I’m more confident at this point that you’re not well served by 
>>> DatasetGraphInMemory, which has very strong assumptions about the speedy 
>>> reachability of data. DSGInMemory was built for situations when all of the 
>>> data is in core memory and multithreaded access is important. If you have a 
>>> lot of core memory and can load the data fully, you might want to use it, 
>>> but that doesn’t sound at all like your case. Otherwise, as far as what the 
>>> right extension point is, I will need to defer to committers or more 
>>> experienced devs, but I think you may need to look at DatasetGraph from a 
>>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads 
>>> directly, for example.
>>> 
>>> Can you tell us a bit more about your full scenario? I don’t know much 
>>> about STEP (sorry if others do)— is there a canonical RDF formulation? What 
>>> kinds of queries are you going to be using with this data? How quickly are 
>>> users going to need to switch contexts between datasets?
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>>> On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:
>>>> 
>>>> 
>>>> 
>>>> Thanks for the fast response!
>>>>     I have a set of disk based binary SDAI repository's which are based on 
>>>> ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. 
>>>> In particular my files are IFC2x3 files which can be +1Gb. However after 
>>>> processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb 
>>>> STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into 
>>>> TDB I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar 
>>>> sized STEP files...
>>>> Typically only a small subset of the STEP file needs to be queried but 
>>>> sometimes other parts need to be queried. Hence the on demand caching and 
>>>> DatasetGraphInMemory. The aim is that in the find methods I check a cache 
>>>> and call the native SDAI find methods based on the node URI's in the case 
>>>> of a cache miss, calling the add methods for the minted tuples, then 
>>>> passing on the call to the super find. The underlying SDAI repository's 
>>>> are static so once a subject is cached no other work is required.
>>>> As the DatasetGraphInMemory is commented as very fast quad and triple 
>>>> access it seemed a logical place to extend. The shim cache would be set to 
>>>> expire entries and limit the total number of tuples power repository. This 
>>>> is currently deployed on a 256Gb ram device.
>>>> In the bigger picture l have a service very similar to Fuseki which allows 
>>>> SPARQL requests to be made against Datasets which are either TDB or SDAI 
>>>> cache backed.
>>>> What was DatasetGraphInMemory created for..? ;-)
>>>> Dick
>>>> 
>>>> -------- Original message --------
>>>> From: "A. Soroka" <aj...@virginia.edu>
>>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>>> To: users@jena.apache.org
>>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
>>>> DatasetGraphInMemory
>>>> 
>>>> I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
>>>> better answered by other folks who are more familiar with Jena's 
>>>> DatasetGraph implementations, or may actually not have anything to do with 
>>>> DatasetGraph (see below for why). I will try to give some background 
>>>> information, though.
>>>> 
>>>> There are several paths by which where DatasetGraphInMemory can be 
>>>> performing finds, but they come down to two places in the code, 
>>>> QuadTable:: and TripleTable::find and in default operation, the concrete 
>>>> forms:
>>>> 
>>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>> 
>>>> for Quads and
>>>> 
>>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>> 
>>>> for Triples. Those methods are reused by all the differently-ordered 
>>>> indexes within Hex- or TriTable, each of which will answer a find by 
>>>> selecting an appropriately-ordered index based on the fixed and variable 
>>>> slots in the find pattern and using the concrete methods above to stream 
>>>> tuples back.
>>>> 
>>>> As to why you are seeing your methods called in some places and not in 
>>>> others, DatasetGraphBaseFind features methods like findInDftGraph(), 
>>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are 
>>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does 
>>>> not make a selection between those methods— that is done by 
>>>> DatasetGraphBaseFind. So that is where you will find the logic that should 
>>>> answer your question.
>>>> 
>>>> Can you say a little more about your use case? You seem to have some 
>>>> efficient representation in memory of your data (I hope it is in-memory— 
>>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to 
>>>> create tuples on the fly as queries are received. That is really not at 
>>>> all what DSGInMemory is for (DSGInMemory is using map structures for 
>>>> indexing and in default mode, uses persistent data structures to support 
>>>> transactionality). I am wondering whether you might not be much better 
>>>> served by tapping into Jena at a different place, perhaps implementing the 
>>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just 
>>>> implementing Quad- and TripleTable and using the constructor 
>>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com> wrote:
>>>>> 
>>>>> Hi.
>>>>> 
>>>>> Does anyone know the "find" paths through DatasetGraphInMemory please?
>>>>> 
>>>>> For example if I extend DatasetGraphInMemory and override
>>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on 
>>>>> "select
>>>>> * where {?s ?p ?o}" however if I override the other
>>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
>>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
>>>>> calling (but as I type I'm guessing it's optimised to return the HexTable
>>>>> nodes...).
>>>>> 
>>>>> Would I be better off overriding HexTable and TriTable classes find 
>>>>> methods
>>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
>>>>> one of these methods?
>>>>> 
>>>>> I need to know the root find methods so that I can shim them to create
>>>>> triples/quads before they perform the find.
>>>>> 
>>>>> I need to create Triples/Quads on demand (because a bulk load would create
>>>>> ~100M triples but only ~1000 are ever queried) and the source binary form
>>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
>>>>> than quads.
>>>>> 
>>>>> Regards Dick Murray.
>>>> 
>>> 
>> 
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to