Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Joint Thu, 03 Mar 2016 02:45:07 -0800

    
Hi Andy.
I implemented the entire SPI at the DatasetGraph and Graph level. It got to the 
point where I had overridden more methods than not. In addition writes are not 
supported and contains methods which call find(ANY, ANY, ANY) play havoc with 
an on demand triple caching algorithm! ;-) I'm using the TriTable because it 
fits and quads are spoofed via triple to quad iterator.
I have a set of filters and handles which the find triple is compared against 
and either passed straight to the TriTable if the triple has been handled 
before or its passed to the appropriate handle which adds the triples to the 
TriTable then calls the find. As the underlying data is a tree a cache depth 
can be set which allows related triples to be cached. Also the cache can be 
preloaded with common triples e.g. ANY RDF:type ?.
Would you consider a generic version for the Jena code base?



Dick

-------- Original message --------
From: Andy Seaborne <a...@apache.org> 
Date: 18/02/2016  6:31 pm  (GMT+00:00) 
To: users@jena.apache.org 
Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using
  DatasetGraphInMemory 

Hi,

I'm not seeing how tapping into the implementation of 
DatasetGraphInMemory is going to help (through the details

As well as the DatasetGraphMap approach, one other thought that occurred 
to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph 
implementation.

It loads, and clears, the mapped graph on-demand, and passes the find() 
call through to the now-setup data.

        Andy

On 16/02/16 17:42, A. Soroka wrote:
>> Based on your description the DatasetGraphInMemory would seem to match the 
>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>> large over head to using the add methods?
>
> No, I certainly did not mean to give that impression, and I don’t think it is 
> entirely accurate. DSGInMemory was definitely not at all meant for dynamic 
> loading. That doesn’t mean it can’t be used that way, but that was not in the 
> design, which assumed that all tuples take about the same amount of time to 
> access and that all of the same type are coming from the same implementation 
> (in a QuadTable and a TripleTable).
>
> The overhead of mutating a dataset is mostly inside the implementations of 
> TupleTable that are actually used to store tuples. You should be aware that 
> TupleTable extends TransactionalComponent, so if you want to use it to create 
> some kind of connection to your storage, you will need to make that 
> connection fully transactional. That doesn’t sound at all trivial in your 
> case.
>
> At this point it seems to me that extending DatasetGraphMap (and implementing 
> GraphMaker and Graph instead of TupleTable) might be a more appropriate 
> design for your work. You can put dynamic loading behavior in Graph (or a 
> GraphView subtype) just as easily as in TupleTable subtypes. Are there 
> reasons around the use of transactionality in your work that demand the 
> particular semantics supported by DSGInMemory?
>
> ---
> A. Soroka
> The University of Virginia Library
>
>> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote:
>>
>>
>>
>> Hi.
>> The quick full scenario is a distributed DaaS which supports queries, 
>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail 
>> because I spoke to him previously. We achieve multiple writes by having 
>> parallel Datasets, both traditional TDB and on demand in memory. Writes are 
>> sent to a free dataset, free being not in a write transaction. That's a 
>> simplistic overview...
>> Queries are handled by a dataset proxy which builds a dynamic dataset based 
>> on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy 
>> find method to issue the query to all known Datasets and return the union of 
>> results. Various dataset proxies exist, some load TDBs, others load TTL 
>> files into graphs, others dynamically create tuples. The common thing being 
>> they are all presented as Datasets backed by DatasetGraph. Thus a SPARQL 
>> query can result in multiple Datasets being loaded to satisfy the query.
>> Nodes can be preloaded which then load Datasets to satisfy finds. This way 
>> the system can be scaled to handle increased work loads. Also specific nodes 
>> can be targeted to specific hardware.
>> When a graph URI is encountered the proxy can interpret it's structure. So 
>> urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI repository 
>> foo to be dynamically loaded into memory along with the quads which are 
>> required to satisfy the find.
>> Typically a group of people will be working on a set of data so the first to 
>> query will load the dataset then it will be accessed multiple times. There 
>> will be an initial dynamic load of data which will tail off with some 
>> additional loading over time.
>> Based on your description the DatasetGraphInMemory would seem to match the 
>> dynamic load requirement. How did you foresee it being loaded? Is there a 
>> large over head to using the add methods?
>> A typical scenario would be to search all SDAI repository's for some key 
>> information then load detailed information in some, continuing to drill down.
>> Hope this helps.
>> I'm going to extend the hex and tri tables and run some tests. I've already 
>> shimed the DGTriplesQuads so the actual caching code already exists and 
>> should bed easy to hook on.
>> Dick
>>
>> -------- Original message --------
>> From: "A. Soroka" <aj...@virginia.edu>
>> Date: 12/02/2016  11:07 pm  (GMT+00:00)
>> To: users@jena.apache.org
>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
>> DatasetGraphInMemory
>>
>> Okay, I’m more confident at this point that you’re not well served by 
>> DatasetGraphInMemory, which has very strong assumptions about the speedy 
>> reachability of data. DSGInMemory was built for situations when all of the 
>> data is in core memory and multithreaded access is important. If you have a 
>> lot of core memory and can load the data fully, you might want to use it, 
>> but that doesn’t sound at all like your case. Otherwise, as far as what the 
>> right extension point is, I will need to defer to committers or more 
>> experienced devs, but I think you may need to look at DatasetGraph from a 
>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads 
>> directly, for example.
>>
>> Can you tell us a bit more about your full scenario? I don’t know much about 
>> STEP (sorry if others do)— is there a canonical RDF formulation? What kinds 
>> of queries are you going to be using with this data? How quickly are users 
>> going to need to switch contexts between datasets?
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>>> On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote:
>>>
>>>
>>>
>>> Thanks for the fast response!
>>>    I have a set of disk based binary SDAI repository's which are based on 
>>>ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. 
>>>In particular my files are IFC2x3 files which can be +1Gb. However after 
>>>processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb 
>>>STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into TDB 
>>>I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar sized 
>>>STEP files...
>>> Typically only a small subset of the STEP file needs to be queried but 
>>> sometimes other parts need to be queried. Hence the on demand caching and 
>>> DatasetGraphInMemory. The aim is that in the find methods I check a cache 
>>> and call the native SDAI find methods based on the node URI's in the case 
>>> of a cache miss, calling the add methods for the minted tuples, then 
>>> passing on the call to the super find. The underlying SDAI repository's are 
>>> static so once a subject is cached no other work is required.
>>> As the DatasetGraphInMemory is commented as very fast quad and triple 
>>> access it seemed a logical place to extend. The shim cache would be set to 
>>> expire entries and limit the total number of tuples power repository. This 
>>> is currently deployed on a 256Gb ram device.
>>> In the bigger picture l have a service very similar to Fuseki which allows 
>>> SPARQL requests to be made against Datasets which are either TDB or SDAI 
>>> cache backed.
>>> What was DatasetGraphInMemory created for..? ;-)
>>> Dick
>>>
>>> -------- Original message --------
>>> From: "A. Soroka" <aj...@virginia.edu>
>>> Date: 12/02/2016  6:21 pm  (GMT+00:00)
>>> To: users@jena.apache.org
>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using 
>>> DatasetGraphInMemory
>>>
>>> I wrote the DatasetGraphInMemory  code, but I suspect your question may be 
>>> better answered by other folks who are more familiar with Jena's 
>>> DatasetGraph implementations, or may actually not have anything to do with 
>>> DatasetGraph (see below for why). I will try to give some background 
>>> information, though.
>>>
>>> There are several paths by which where DatasetGraphInMemory can be 
>>> performing finds, but they come down to two places in the code, QuadTable:: 
>>> and TripleTable::find and in default operation, the concrete forms:
>>>
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100
>>>
>>> for Quads and
>>>
>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99
>>>
>>> for Triples. Those methods are reused by all the differently-ordered 
>>> indexes within Hex- or TriTable, each of which will answer a find by 
>>> selecting an appropriately-ordered index based on the fixed and variable 
>>> slots in the find pattern and using the concrete methods above to stream 
>>> tuples back.
>>>
>>> As to why you are seeing your methods called in some places and not in 
>>> others, DatasetGraphBaseFind features methods like findInDftGraph(), 
>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are 
>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does not 
>>> make a selection between those methods— that is done by 
>>> DatasetGraphBaseFind. So that is where you will find the logic that should 
>>> answer your question.
>>>
>>> Can you say a little more about your use case? You seem to have some 
>>> efficient representation in memory of your data (I hope it is in-memory— 
>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to 
>>> create tuples on the fly as queries are received. That is really not at all 
>>> what DSGInMemory is for (DSGInMemory is using map structures for indexing 
>>> and in default mode, uses persistent data structures to support 
>>> transactionality). I am wondering whether you might not be much better 
>>> served by tapping into Jena at a different place, perhaps implementing the 
>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just 
>>> implementing Quad- and TripleTable and using the constructor 
>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t).
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com> wrote:
>>>>
>>>> Hi.
>>>>
>>>> Does anyone know the "find" paths through DatasetGraphInMemory please?
>>>>
>>>> For example if I extend DatasetGraphInMemory and override
>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on "select
>>>> * where {?s ?p ?o}" however if I override the other
>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p
>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's
>>>> calling (but as I type I'm guessing it's optimised to return the HexTable
>>>> nodes...).
>>>>
>>>> Would I be better off overriding HexTable and TriTable classes find methods
>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in
>>>> one of these methods?
>>>>
>>>> I need to know the root find methods so that I can shim them to create
>>>> triples/quads before they perform the find.
>>>>
>>>> I need to create Triples/Quads on demand (because a bulk load would create
>>>> ~100M triples but only ~1000 are ever queried) and the source binary form
>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads)
>>>> than quads.
>>>>
>>>> Regards Dick Murray.
>>>
>>
>

Re: SPI DatasetGraph creating Triples/Quads on demand using DatasetGraphInMemory

Reply via email to