I’m confused about two of your points here. Let me separate them out so we can discuss them easily.
1) "writes are not supported”: Writes are certainly supported in the Graph/DatasetGraph SPI. Graph::add and ::delete, DatasetGraph::add, ::delete, ::deleteAny… after all, Graph and DatasetGraph are the basic abstractions implemented by Jena’s own out-of-the-box implementations of RDF storage. Can you explain what you mean by this? 2) "methods which call find(ANY, ANY, ANY) play havoc with an on demand triple caching algorithm”: The subtypes of TupleTable with which you are working have exactly the same kinds of find() methods. Why are they not problematic in that context? --- A. Soroka The University of Virginia Library > On Mar 3, 2016, at 5:47 AM, Joint <dandh...@gmail.com> wrote: > > > > Hi Andy. > I implemented the entire SPI at the DatasetGraph and Graph level. It got to > the point where I had overridden more methods than not. In addition writes > are not supported and contains methods which call find(ANY, ANY, ANY) play > havoc with an on demand triple caching algorithm! ;-) I'm using the TriTable > because it fits and quads are spoofed via triple to quad iterator. > I have a set of filters and handles which the find triple is compared against > and either passed straight to the TriTable if the triple has been handled > before or its passed to the appropriate handle which adds the triples to the > TriTable then calls the find. As the underlying data is a tree a cache depth > can be set which allows related triples to be cached. Also the cache can be > preloaded with common triples e.g. ANY RDF:type ?. > Would you consider a generic version for the Jena code base? > > > Dick > > -------- Original message -------- > From: Andy Seaborne <a...@apache.org> > Date: 18/02/2016 6:31 pm (GMT+00:00) > To: users@jena.apache.org > Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using > DatasetGraphInMemory > > Hi, > > I'm not seeing how tapping into the implementation of > DatasetGraphInMemory is going to help (through the details > > As well as the DatasetGraphMap approach, one other thought that occurred > to me is to have a dataset (DatasetGraphWrapper) over any DatasetGraph > implementation. > > It loads, and clears, the mapped graph on-demand, and passes the find() > call through to the now-setup data. > > Andy > > On 16/02/16 17:42, A. Soroka wrote: >>> Based on your description the DatasetGraphInMemory would seem to match the >>> dynamic load requirement. How did you foresee it being loaded? Is there a >>> large over head to using the add methods? >> >> No, I certainly did not mean to give that impression, and I don’t think it >> is entirely accurate. DSGInMemory was definitely not at all meant for >> dynamic loading. That doesn’t mean it can’t be used that way, but that was >> not in the design, which assumed that all tuples take about the same amount >> of time to access and that all of the same type are coming from the same >> implementation (in a QuadTable and a TripleTable). >> >> The overhead of mutating a dataset is mostly inside the implementations of >> TupleTable that are actually used to store tuples. You should be aware that >> TupleTable extends TransactionalComponent, so if you want to use it to >> create some kind of connection to your storage, you will need to make that >> connection fully transactional. That doesn’t sound at all trivial in your >> case. >> >> At this point it seems to me that extending DatasetGraphMap (and >> implementing GraphMaker and Graph instead of TupleTable) might be a more >> appropriate design for your work. You can put dynamic loading behavior in >> Graph (or a GraphView subtype) just as easily as in TupleTable subtypes. Are >> there reasons around the use of transactionality in your work that demand >> the particular semantics supported by DSGInMemory? >> >> --- >> A. Soroka >> The University of Virginia Library >> >>> On Feb 13, 2016, at 5:18 AM, Joint <dandh...@gmail.com> wrote: >>> >>> >>> >>> Hi. >>> The quick full scenario is a distributed DaaS which supports queries, >>> updates, transforms and bulkloads. Andy Seaborne knows some of the detail >>> because I spoke to him previously. We achieve multiple writes by having >>> parallel Datasets, both traditional TDB and on demand in memory. Writes are >>> sent to a free dataset, free being not in a write transaction. That's a >>> simplistic overview... >>> Queries are handled by a dataset proxy which builds a dynamic dataset based >>> on the graph URIs. For example the graph URI urn:Iungo:all causes the proxy >>> find method to issue the query to all known Datasets and return the union >>> of results. Various dataset proxies exist, some load TDBs, others load TTL >>> files into graphs, others dynamically create tuples. The common thing being >>> they are all presented as Datasets backed by DatasetGraph. Thus a SPARQL >>> query can result in multiple Datasets being loaded to satisfy the query. >>> Nodes can be preloaded which then load Datasets to satisfy finds. This way >>> the system can be scaled to handle increased work loads. Also specific >>> nodes can be targeted to specific hardware. >>> When a graph URI is encountered the proxy can interpret it's structure. So >>> urn:Iungo:sdai/foo/bar would cause the SDAI model bar in the SDAI >>> repository foo to be dynamically loaded into memory along with the quads >>> which are required to satisfy the find. >>> Typically a group of people will be working on a set of data so the first >>> to query will load the dataset then it will be accessed multiple times. >>> There will be an initial dynamic load of data which will tail off with some >>> additional loading over time. >>> Based on your description the DatasetGraphInMemory would seem to match the >>> dynamic load requirement. How did you foresee it being loaded? Is there a >>> large over head to using the add methods? >>> A typical scenario would be to search all SDAI repository's for some key >>> information then load detailed information in some, continuing to drill >>> down. >>> Hope this helps. >>> I'm going to extend the hex and tri tables and run some tests. I've already >>> shimed the DGTriplesQuads so the actual caching code already exists and >>> should bed easy to hook on. >>> Dick >>> >>> -------- Original message -------- >>> From: "A. Soroka" <aj...@virginia.edu> >>> Date: 12/02/2016 11:07 pm (GMT+00:00) >>> To: users@jena.apache.org >>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using >>> DatasetGraphInMemory >>> >>> Okay, I’m more confident at this point that you’re not well served by >>> DatasetGraphInMemory, which has very strong assumptions about the speedy >>> reachability of data. DSGInMemory was built for situations when all of the >>> data is in core memory and multithreaded access is important. If you have a >>> lot of core memory and can load the data fully, you might want to use it, >>> but that doesn’t sound at all like your case. Otherwise, as far as what the >>> right extension point is, I will need to defer to committers or more >>> experienced devs, but I think you may need to look at DatasetGraph from a >>> more close-to-the-metal point. TDB extends DatasetGraphTriplesQuads >>> directly, for example. >>> >>> Can you tell us a bit more about your full scenario? I don’t know much >>> about STEP (sorry if others do)— is there a canonical RDF formulation? What >>> kinds of queries are you going to be using with this data? How quickly are >>> users going to need to switch contexts between datasets? >>> >>> --- >>> A. Soroka >>> The University of Virginia Library >>> >>>> On Feb 12, 2016, at 2:44 PM, Joint <dandh...@gmail.com> wrote: >>>> >>>> >>>> >>>> Thanks for the fast response! >>>> I have a set of disk based binary SDAI repository's which are based on >>>> ISO10303 parts 11/21/25/27 otherwise known as the EXPRESS/STEP/SDAI parts. >>>> In particular my files are IFC2x3 files which can be +1Gb. However after >>>> processing into a SDAI binary I typically see a size reduction e.g. 1.4Gb >>>> STEP file becomes a 1Gb SDAI repository. If I convert the STEP file into >>>> TDB I get +100M quads and a 50Gb folder. Multiplied by 1000's of similar >>>> sized STEP files... >>>> Typically only a small subset of the STEP file needs to be queried but >>>> sometimes other parts need to be queried. Hence the on demand caching and >>>> DatasetGraphInMemory. The aim is that in the find methods I check a cache >>>> and call the native SDAI find methods based on the node URI's in the case >>>> of a cache miss, calling the add methods for the minted tuples, then >>>> passing on the call to the super find. The underlying SDAI repository's >>>> are static so once a subject is cached no other work is required. >>>> As the DatasetGraphInMemory is commented as very fast quad and triple >>>> access it seemed a logical place to extend. The shim cache would be set to >>>> expire entries and limit the total number of tuples power repository. This >>>> is currently deployed on a 256Gb ram device. >>>> In the bigger picture l have a service very similar to Fuseki which allows >>>> SPARQL requests to be made against Datasets which are either TDB or SDAI >>>> cache backed. >>>> What was DatasetGraphInMemory created for..? ;-) >>>> Dick >>>> >>>> -------- Original message -------- >>>> From: "A. Soroka" <aj...@virginia.edu> >>>> Date: 12/02/2016 6:21 pm (GMT+00:00) >>>> To: users@jena.apache.org >>>> Subject: Re: SPI DatasetGraph creating Triples/Quads on demand using >>>> DatasetGraphInMemory >>>> >>>> I wrote the DatasetGraphInMemory code, but I suspect your question may be >>>> better answered by other folks who are more familiar with Jena's >>>> DatasetGraph implementations, or may actually not have anything to do with >>>> DatasetGraph (see below for why). I will try to give some background >>>> information, though. >>>> >>>> There are several paths by which where DatasetGraphInMemory can be >>>> performing finds, but they come down to two places in the code, >>>> QuadTable:: and TripleTable::find and in default operation, the concrete >>>> forms: >>>> >>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapQuadTable.java#L100 >>>> >>>> for Quads and >>>> >>>> https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/core/mem/PMapTripleTable.java#L99 >>>> >>>> for Triples. Those methods are reused by all the differently-ordered >>>> indexes within Hex- or TriTable, each of which will answer a find by >>>> selecting an appropriately-ordered index based on the fixed and variable >>>> slots in the find pattern and using the concrete methods above to stream >>>> tuples back. >>>> >>>> As to why you are seeing your methods called in some places and not in >>>> others, DatasetGraphBaseFind features methods like findInDftGraph(), >>>> findInSpecificNamedGraph(), findInAnyNamedGraphs() etc. and that these are >>>> the methods that DatasetGraphInMemory is implementing. DSGInMemory does >>>> not make a selection between those methods— that is done by >>>> DatasetGraphBaseFind. So that is where you will find the logic that should >>>> answer your question. >>>> >>>> Can you say a little more about your use case? You seem to have some >>>> efficient representation in memory of your data (I hope it is in-memory— >>>> otherwise it is a very bad choice to subclass DSGInMemory) and you want to >>>> create tuples on the fly as queries are received. That is really not at >>>> all what DSGInMemory is for (DSGInMemory is using map structures for >>>> indexing and in default mode, uses persistent data structures to support >>>> transactionality). I am wondering whether you might not be much better >>>> served by tapping into Jena at a different place, perhaps implementing the >>>> Graph SPI directly. Or, if reusing DSGInMemory is the right choice, just >>>> implementing Quad- and TripleTable and using the constructor >>>> DatasetGraphInMemory(final QuadTable i, final TripleTable t). >>>> >>>> --- >>>> A. Soroka >>>> The University of Virginia Library >>>> >>>>> On Feb 12, 2016, at 12:58 PM, Dick Murray <dandh...@gmail.com> wrote: >>>>> >>>>> Hi. >>>>> >>>>> Does anyone know the "find" paths through DatasetGraphInMemory please? >>>>> >>>>> For example if I extend DatasetGraphInMemory and override >>>>> DatasetGraphBaseFind.find(node, Node, Node, Node) it breakpoints on >>>>> "select >>>>> * where {?s ?p ?o}" however if I override the other >>>>> DatasetGraphBaseFind.find(...) methods, "select * where {graph ?g {?s ?p >>>>> ?o}}" does not trigger a breakpoint i.e. I don't know what method it's >>>>> calling (but as I type I'm guessing it's optimised to return the HexTable >>>>> nodes...). >>>>> >>>>> Would I be better off overriding HexTable and TriTable classes find >>>>> methods >>>>> when I create the DatasetGraphInMemory? Are all finds guaranteed to end in >>>>> one of these methods? >>>>> >>>>> I need to know the root find methods so that I can shim them to create >>>>> triples/quads before they perform the find. >>>>> >>>>> I need to create Triples/Quads on demand (because a bulk load would create >>>>> ~100M triples but only ~1000 are ever queried) and the source binary form >>>>> is more efficient (binary ~1GB native tree versus TDB ~50GB ~100M quads) >>>>> than quads. >>>>> >>>>> Regards Dick Murray. >>>> >>> >> >