Re: Subclass caching has some problems on Fuseki startup

Ryan Stokes Tue, 21 Sep 2021 15:43:12 -0700

Thanks for giving this some more thought, Andy.

We could consider different ways of doing both inference and updating. I
think the basic requirements are that a collection of common medical
datasets (ICD-10, RxNorm, and the like) be treated as a high-performance
ontology - updated at most daily from various sources. We could use another
writable model, small but with more frequent changes, which also needs to
be very fast for queries and build on the ontology and simple (RDFS)
inference there.


Would you recommend a layered configuration so that the ontology model can
remain read-only? I recall documentation on it, but haven't encountered an
example in use, limited as my search has been so far.

As for OWL, I haven't looked closer into the reasoners other than to find
that owl-fb-mini.rules has this (?x ?p ?y) rule in it:

# This one could be omitted since the results are not really very
interesting!
#[rdf1and4: (?x ?p ?y) -> (?p rdf:type rdf:Property), (?x rdf:type
rdfs:Resource), (?y rdf:type rdfs:Resource)]
[rdf4: (?x ?p ?y) -> (?p rdf:type rdf:Property)]

I'm going run it without that rule to see if we can blame it. Thanks for
the pointer.

~Ryan

On Sat, Sep 18, 2021 at 11:38 AM Andy Seaborne <[email protected]> wrote:

> Hi Ryan,
>
> On 17/09/2021 16:22, Ryan Stokes wrote:
> > Hi Andy,
> >
> > By way of introduction I've been exploring ontology solutions
> > with Brandon recently using Jena and Fuseki and come to
> > appreciate your capable stewardship and responsive
> > engagement with this community. Thank you.
> >
> > I was able to replicate Brandon's problem loading the ICD-10
> > dataset using any of the built-in OWL reasoners without search
> > indexing. However it did successfully load and respond fast to
> > queries using RDFSRuleReasoner, as well as Transitive and Generic.
>
> OK - we're getting closer.
>
> That "pump" loop could well be cause if it is from a rule with with (?x
> ?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS
> omits that rule. This dataset is only 800K triples.
>
> The rules engine copes with the schema and data changing during runtime
> with an engine that minimises re-computation at the expense of a lot
> more initial work and crucially tracking with in-memory state. I guess
> it is on first-touch doing all the setup work.
>
> [Later: It is not specific to TDB - seems to happen with any base
> storage including both in-memory kinds.]
>
> > Brandon is better able to say whether we need OWL for other
> > reasons, but we do want to use ICD-10-CM with data for inference.
> > Would* Data with RDFS Inferencing* have advantages over using the
> > built-in RDFSRuleReasoner for that?
>
> Maybe :-)
>
> Data+RDFS is different - it's not trying to be a replacement for the
> rules engine for RDFS. We have the rules engine for complete adherence
> to RDFS.
>
> Data+RDFS:
> 1/ It is a fixed RDFS (subclass/subproperty/domain/range).
>     No axioms. No x:directSubClassOf.
> 2/ Applies to every graph in the dataset.
> 3/ Assumes the schema is fixed - no update to the schema at runtime.
> 4/ The schema is invisible - the app sees data and inferred triples.
>
> but it should scale and work with persistent databases.
>
> [ The "no update to the schema" could be changed. Programming needed
> though. ]
>
> So - Ryan, Brandon - what inference does your usage need? Is the
> schema/ontology updated during runtime?
>
>      Andy
>
> >
> > Thanks again for any help in advance,
> >
> > Ryan
> >
> > *JFYI, the Transitive- and RDFSRuleReasoners inferred*
> >
> > *570k :subClassOf and an additional 192k :type triples over the base 96k
> of
> > each relation, respectively.*
> >
> >
> > *Profiling the OWL reasoner with VisualVM I was able to see that it seems
> > to cycle without end through*
> >
> >
> > *Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
> > Node.sameValueAs(). I have yet to try this on a reduced dataset to see
> if I
> > can find the minimum necessary to replicate the spin.*
> >
> > On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne <[email protected]> wrote:
> >
> >> Hi Brandon,
> >>
> >> The configuration is quite complex - it's likely due to the inference
> >> layer but it would be worth trying without the text index to confirm
> >> that especially for the loading.
> >>
> >> Do you need all that
> >> <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> >> offers or is all you want RDFS subclass?
> >>
> >> Because there is
> >>     https://jena.apache.org/documentation/rdfs/
> >> (give ICD10CM as both data and also in a file to be the schema).
> >>
> >> The schema is assumed to be fixed which might not work for you long term
> >> but it is another data point to understand the situation.
> >>
> >> About ICD10CM itseld - are you wanting to navigate its structure or use
> >> it with data for inference? If it is to navigate its structure do you
> >> even want inference?
> >>
> >>       Andy
> >>
> >> On 14/09/2021 00:42, Brandon Sara wrote:
> >>> I have been able to create an easily reproducible scenario that others
> >> can use to replicate and test the issues that I’m seeing:
> >>>
> >>> 1. Start fuseki using the config that I’ve listed below.
> >>> 2. Attempt to load the latest version of ICD-10 CM as provided freely
> by
> >> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
> >>>
> >>> If inference is enabled, then I can’t even get the turtle file to load
> >> in its entirety. If I load the turtle file without inference, then the
> load
> >> completes, but upon restarting the server and submitting a request, the
> >> service doesn’t finish processing the request in any reasonable amount
> of
> >> time, no matter how simple the query of the request is (one that
> actually
> >> queries data from the dataset at least).
> >>>
> >>> Config:
> >>>
> >>> PREFIX dcterms: <http://purl.org/dc/terms/>
> >>> PREFIX fuseki: <http://jena.apache.org/fuseki#>
> >>> PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> >>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> >>> PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
> >>> PREFIX text: <http://jena.apache.org/text#>
> >>>
> >>> [] rdf:type fuseki:Server ;
> >>>     fuseki:pingEP true ;
> >>>     fuseki:statsEP true ;
> >>>     fuseki:metricsEP true ;
> >>>     fuseki:compactEP true ;
> >>>
> >>>     ja:context [
> >>>       ja:cxtName "arq:queryTimeout" ;
> >>>       ja:cxtValue "10000,60000" ;
> >>>     ] ;
> >>> .
> >>>
> >>> <#kgService> a fuseki:Service ;
> >>>     fuseki:name "kg" ;
> >>>     fuseki:dataset <#kgIndexedDataset> ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name
> "data";
> >> ] ;
> >>> .
> >>>
> >>> <#kgIndexedDataset> rdf:type text:TextDataset ;
> >>>     text:dataset <#kgInferredDataset> ;
> >>>     text:index <#kgIndex> ;
> >>> .
> >>>
> >>> <#kgIndex> a text:TextIndexLucene ;
> >>>     text:directory <file:/fuseki/databases/kg.index> ;
> >>>     text:entityMap <#kgEntityMap> ;
> >>>     text:storeValues true ;
> >>>     text:queryParser [ a text:ComplexPhraseQueryParser ]
> >>> .
> >>>
> >>> <#kgEntityMap> a text:EntityMap ;
> >>>     text:defaultField "label" ;
> >>>     text:entityField "uri" ;
> >>>     text:uidField "uid" ;
> >>>     text:langField "lang" ;
> >>>     text:graphField "graph" ;
> >>>     text:map (
> >>>       [ text:field "id" ;
> >>>         text:predicate dcterms:identifier ]
> >>>
> >>>       [ text:field "label" ;
> >>>         text:predicate rdfs:label ]
> >>>     ) ;
> >>> .
> >>>
> >>> <#kgInferredDataset> a ja:RDFDataset ;
> >>>     ja:defaultGraph <#kgInferenceModel> ;
> >>> .
> >>>
> >>> <#kgInferenceModel> a ja:InfModel ;
> >>>     ja:baseModel <#kgTdbGraph> ;
> >>>     ja:reasoner [
> >>>       ja:reasonerURL <
> http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> >>>     ] ;
> >>> .
> >>>
> >>> <#kgTdbGraph> a tdb2:GraphTDB2 ;
> >>>     tdb2:dataset <#kgTdbDataset> ;
> >>> .
> >>>
> >>> <#kgTdbDataset> a tdb2:DatasetTDB2 ;
> >>>     tdb2:location "/fuseki/databases/kg" ;
> >>> .
> >>>
> >>>
> >>>
> >>> No PHI in Email: PointClickCare and Collective Medical, A
> PointClickCare
> >> Company, policies prohibit sending protected health information (PHI) by
> >> email, which may violate regulatory requirements. If sending PHI is
> >> necessary, please contact the sender for secure delivery instructions.
> >>>
> >>> Confidentiality Notice: This email message, including any attachments,
> >> is for the sole use of the intended recipient(s) and may contain
> >> confidential and privileged information. Any unauthorized review, use,
> >> disclosure or distribution is prohibited. If you are not the intended
> >> recipient, please contact the sender by reply email and destroy all
> copies
> >> of the original message.
> >>>
> >>
> >
>

Re: Subclass caching has some problems on Fuseki startup

Reply via email to