Re: Subclass caching has some problems on Fuseki startup

2021-09-29 Thread Brandon Sara
> SNOMED has a conversion to OWL - isn't that OWL functional syntax? Or do you 
> have another tool that converts RF2 to RDF?

I used the SNOMED tool to convert to OWL functional syntax, then used robot to 
convert that to turtle

> what OWL features are you going to use?  SNOMED uses more than subclass.

equivalentClass is definitely one we want to use. sameAs is also a possibility 
(though performance may rule that one out). Many of the property stuff like 
inverseOf and what not. At this point in time, other than what is in SNOMED 
(which is honestly pretty complex/impressive), we aren’t explicitly using much 
of OWL…mainly because we haven’t been able to because things just won’t load 
when we have full text indexing in place with even just the micro profile 
active. I would hope that we could at least use EL (or micro) and the features 
provided there.
No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


Re: Subclass caching has some problems on Fuseki startup

2021-09-28 Thread Andy Seaborne




On 22/09/2021 17:17, Brandon Sara wrote:


We will when we start pulling in realtime data that we want the SNOMED 
inference rules to help us discover new knowledge with.


SNOMED has a conversion to OWL - isn't that OWL functional syntax? Or do 
you have another tool that converts RF2 to RDF?


Also - what OWL features are you going to use?  SNOMED uses more than 
subclass.


Andy





Re: Subclass caching has some problems on Fuseki startup

2021-09-22 Thread Brandon Sara
> Which reasoner? IIRC SnomedCT uses various OWL features
> The default RDFS reasoner does not include the "rdf4" rule which is a 
> whole-dataset rule
> A ruleset tuned to needs may work better.

I tried this to only include subclass and equivalent class using the generic 
reasoner, but the dataset still would not load. Again, even only using the 
transitive reasoner (which I’ve found tends to be the most performant but 
haven’t run actual numbers yet) over snomed wouldn’t load the dataset.

> If you want to navigate the ontology AND apply it to data, then you may need 
> two copies, one with and one with inference. If subclass closure has been 
> applied, you can’t see easily what the immediate parent of a concept is 
> (ontology browsing task)

Yes, we plan on create a non-inferred fuseki service so that we can navigate 
(if we aren’t using the transitive reasoner, since it provides a direct 
subclass relationship) and an inferred one for all other queries.

> do you need an inference engine at runtime at all?

We will when we start pulling in realtime data that we want the SNOMED 
inference rules to help us discover new knowledge with.


No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


Re: Subclass caching has some problems on Fuseki startup

2021-09-22 Thread Andy Seaborne
(Obviously SNOMED CT is difficult for me to work with as it is licensed 
and not available directly in RDF, at least last time I looked - you 
have to produce it locally- It was about 5M triples.


On 22/09/2021 03:09, Brandon Sara wrote:

We need the inference so that we can know equivalence between classes and subclass relationships 
(eg "type 2 diabetes" is still "diabetes" because it's is a subclass of 
diabetes).


Which is only rdfs;subClassOf? (it in ICD-10 CM)



Another dataset that I've never been able to get to load with any inference 
enabled is SNOMED CT.


SNOMED produce a version with the transitive closure already calculated.


Even when removing all of the owl inference that they have in their dataset and 
pre-calculating the direct subclass relationships, not even the transitive 
reasoner will load the dataset


(overlap with discussion with Ryan).

The general OWL reasoners have an axiomatic rule that touches the whole 
dataset. I doubt you need axiomatic inferences.


And if precalculated transitive closure, do you need an inference engine 
at runtime at all?


(riot --rdfs will expand sublcass and subproperty ahead of time)


(without modification at runtime) once the first query is submitted after 
startup of Fuseki. Granted it's it significantly larger than ICD-10 CM. But 
still, not being able to load it with even pre-calculated direct subclass 
relationships is a huge deal breaker. Not to mention the fact that the real 
power of that dataset comes when the owl inference built into it can actually 
be used. With it, inference on patient data can reveal potential diagnoses that 
would not be inferred without and owl reasoning.


Which reasoner? IIRC SnomedCT uses various OWL features.

The default RDFS reasoner does not include the "rdf4" rule which is a 
whole-dataset rule.


A ruleset tuned to needs may work better.

Andy


Re: Subclass caching has some problems on Fuseki startup

2021-09-22 Thread Andy Seaborne




On 21/09/2021 23:42, Ryan Stokes wrote:

Thanks for giving this some more thought, Andy.

We could consider different ways of doing both inference and updating. I
think the basic requirements are that a collection of common medical
datasets (ICD-10, RxNorm, and the like) be treated as a high-performance
ontology - updated at most daily from various sources. We could use another
writable model, small but with more frequent changes, which also needs to
be very fast for queries and build on the ontology and simple (RDFS)
inference there.

Would you recommend a layered configuration so that the ontology model can
remain read-only? I recall documentation on it, but haven't encountered an
example in use, limited as my search has been so far.


If you want to navigate the ontology AND apply it to data, then you may 
need two copies, one with and one with inference. If subclass closure 
has been applied, you can't see easily what the immediate parent of a 
concept is (ontology browsing task)



As for OWL, I haven't looked closer into the reasoners other than to find
that owl-fb-mini.rules has this (?x ?p ?y) rule in it:

# This one could be omitted since the results are not really very
interesting!
#[rdf1and4: (?x ?p ?y) -> (?p rdf:type rdf:Property), (?x rdf:type
rdfs:Resource), (?y rdf:type rdfs:Resource)]
[rdf4: (?x ?p ?y) -> (?p rdf:type rdf:Property)]

I'm going run it without that rule to see if we can blame it.


Should be interesting.


Thanks for
the pointer.


Tuning the inference you need for the application is going to help. From 
what I take from these discussions and looking at the data, ICD10CM 
only needs rdfs:subClassOf; it does not mention domain and range, nor 
use OWL specific inference.


Andy



~Ryan

On Sat, Sep 18, 2021 at 11:38 AM Andy Seaborne  wrote:


Hi Ryan,

On 17/09/2021 16:22, Ryan Stokes wrote:

Hi Andy,

By way of introduction I've been exploring ontology solutions
with Brandon recently using Jena and Fuseki and come to
appreciate your capable stewardship and responsive
engagement with this community. Thank you.

I was able to replicate Brandon's problem loading the ICD-10
dataset using any of the built-in OWL reasoners without search
indexing. However it did successfully load and respond fast to
queries using RDFSRuleReasoner, as well as Transitive and Generic.


OK - we're getting closer.

That "pump" loop could well be cause if it is from a rule with with (?x
?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS
omits that rule. This dataset is only 800K triples.

The rules engine copes with the schema and data changing during runtime
with an engine that minimises re-computation at the expense of a lot
more initial work and crucially tracking with in-memory state. I guess
it is on first-touch doing all the setup work.

[Later: It is not specific to TDB - seems to happen with any base
storage including both in-memory kinds.]


Brandon is better able to say whether we need OWL for other
reasons, but we do want to use ICD-10-CM with data for inference.
Would* Data with RDFS Inferencing* have advantages over using the
built-in RDFSRuleReasoner for that?


Maybe :-)

Data+RDFS is different - it's not trying to be a replacement for the
rules engine for RDFS. We have the rules engine for complete adherence
to RDFS.

Data+RDFS:
1/ It is a fixed RDFS (subclass/subproperty/domain/range).
 No axioms. No x:directSubClassOf.
2/ Applies to every graph in the dataset.
3/ Assumes the schema is fixed - no update to the schema at runtime.
4/ The schema is invisible - the app sees data and inferred triples.

but it should scale and work with persistent databases.

[ The "no update to the schema" could be changed. Programming needed
though. ]

So - Ryan, Brandon - what inference does your usage need? Is the
schema/ontology updated during runtime?

  Andy



Thanks again for any help in advance,

Ryan

*JFYI, the Transitive- and RDFSRuleReasoners inferred*

*570k :subClassOf and an additional 192k :type triples over the base 96k

of

each relation, respectively.*


*Profiling the OWL reasoner with VisualVM I was able to see that it seems
to cycle without end through*


*Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
Node.sameValueAs(). I have yet to try this on a reduced dataset to see

if I

can find the minimum necessary to replicate the spin.*

On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne  wrote:


Hi Brandon,

The configuration is quite complex - it's likely due to the inference
layer but it would be worth trying without the text index to confirm
that especially for the loading.

Do you need all that

offers or is all you want RDFS subclass?

Because there is
 https://jena.apache.org/documentation/rdfs/
(give ICD10CM as both data and also in a file to be the schema).

The schema is assumed to be fixed which might not work for you long term
but it is another data point to understand 

Re: Subclass caching has some problems on Fuseki startup

2021-09-21 Thread Brandon Sara
We need the inference so that we can know equivalence between classes and 
subclass relationships (eg "type 2 diabetes" is still "diabetes" because it's 
is a subclass of diabetes).

Another dataset that I've never been able to get to load with any inference 
enabled is SNOMED CT. Even when removing all of the owl inference that they 
have in their dataset and pre-calculating the direct subclass relationships, 
not even the transitive reasoner will load the dataset (without modification at 
runtime) once the first query is submitted after startup of Fuseki. Granted 
it's it significantly larger than ICD-10 CM. But still, not being able to load 
it with even pre-calculated direct subclass relationships is a huge deal 
breaker. Not to mention the fact that the real power of that dataset comes when 
the owl inference built into it can actually be used. With it, inference on 
patient data can reveal potential diagnoses that would not be inferred without 
and owl reasoning.

No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


Re: Subclass caching has some problems on Fuseki startup

2021-09-21 Thread Ryan Stokes
Thanks for giving this some more thought, Andy.

We could consider different ways of doing both inference and updating. I
think the basic requirements are that a collection of common medical
datasets (ICD-10, RxNorm, and the like) be treated as a high-performance
ontology - updated at most daily from various sources. We could use another
writable model, small but with more frequent changes, which also needs to
be very fast for queries and build on the ontology and simple (RDFS)
inference there.

Would you recommend a layered configuration so that the ontology model can
remain read-only? I recall documentation on it, but haven't encountered an
example in use, limited as my search has been so far.

As for OWL, I haven't looked closer into the reasoners other than to find
that owl-fb-mini.rules has this (?x ?p ?y) rule in it:

# This one could be omitted since the results are not really very
interesting!
#[rdf1and4: (?x ?p ?y) -> (?p rdf:type rdf:Property), (?x rdf:type
rdfs:Resource), (?y rdf:type rdfs:Resource)]
[rdf4: (?x ?p ?y) -> (?p rdf:type rdf:Property)]

I'm going run it without that rule to see if we can blame it. Thanks for
the pointer.

~Ryan

On Sat, Sep 18, 2021 at 11:38 AM Andy Seaborne  wrote:

> Hi Ryan,
>
> On 17/09/2021 16:22, Ryan Stokes wrote:
> > Hi Andy,
> >
> > By way of introduction I've been exploring ontology solutions
> > with Brandon recently using Jena and Fuseki and come to
> > appreciate your capable stewardship and responsive
> > engagement with this community. Thank you.
> >
> > I was able to replicate Brandon's problem loading the ICD-10
> > dataset using any of the built-in OWL reasoners without search
> > indexing. However it did successfully load and respond fast to
> > queries using RDFSRuleReasoner, as well as Transitive and Generic.
>
> OK - we're getting closer.
>
> That "pump" loop could well be cause if it is from a rule with with (?x
> ?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS
> omits that rule. This dataset is only 800K triples.
>
> The rules engine copes with the schema and data changing during runtime
> with an engine that minimises re-computation at the expense of a lot
> more initial work and crucially tracking with in-memory state. I guess
> it is on first-touch doing all the setup work.
>
> [Later: It is not specific to TDB - seems to happen with any base
> storage including both in-memory kinds.]
>
> > Brandon is better able to say whether we need OWL for other
> > reasons, but we do want to use ICD-10-CM with data for inference.
> > Would* Data with RDFS Inferencing* have advantages over using the
> > built-in RDFSRuleReasoner for that?
>
> Maybe :-)
>
> Data+RDFS is different - it's not trying to be a replacement for the
> rules engine for RDFS. We have the rules engine for complete adherence
> to RDFS.
>
> Data+RDFS:
> 1/ It is a fixed RDFS (subclass/subproperty/domain/range).
> No axioms. No x:directSubClassOf.
> 2/ Applies to every graph in the dataset.
> 3/ Assumes the schema is fixed - no update to the schema at runtime.
> 4/ The schema is invisible - the app sees data and inferred triples.
>
> but it should scale and work with persistent databases.
>
> [ The "no update to the schema" could be changed. Programming needed
> though. ]
>
> So - Ryan, Brandon - what inference does your usage need? Is the
> schema/ontology updated during runtime?
>
>  Andy
>
> >
> > Thanks again for any help in advance,
> >
> > Ryan
> >
> > *JFYI, the Transitive- and RDFSRuleReasoners inferred*
> >
> > *570k :subClassOf and an additional 192k :type triples over the base 96k
> of
> > each relation, respectively.*
> >
> >
> > *Profiling the OWL reasoner with VisualVM I was able to see that it seems
> > to cycle without end through*
> >
> >
> > *Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
> > Node.sameValueAs(). I have yet to try this on a reduced dataset to see
> if I
> > can find the minimum necessary to replicate the spin.*
> >
> > On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne  wrote:
> >
> >> Hi Brandon,
> >>
> >> The configuration is quite complex - it's likely due to the inference
> >> layer but it would be worth trying without the text index to confirm
> >> that especially for the loading.
> >>
> >> Do you need all that
> >> 
> >> offers or is all you want RDFS subclass?
> >>
> >> Because there is
> >> https://jena.apache.org/documentation/rdfs/
> >> (give ICD10CM as both data and also in a file to be the schema).
> >>
> >> The schema is assumed to be fixed which might not work for you long term
> >> but it is another data point to understand the situation.
> >>
> >> About ICD10CM itseld - are you wanting to navigate its structure or use
> >> it with data for inference? If it is to navigate its structure do you
> >> even want inference?
> >>
> >>   Andy
> >>
> >> On 14/09/2021 00:42, Brandon Sara wrote:
> >>> I have been able to create an easily 

Re: Subclass caching has some problems on Fuseki startup

2021-09-18 Thread Andy Seaborne

Hi Ryan,

On 17/09/2021 16:22, Ryan Stokes wrote:

Hi Andy,

By way of introduction I've been exploring ontology solutions
with Brandon recently using Jena and Fuseki and come to
appreciate your capable stewardship and responsive
engagement with this community. Thank you.

I was able to replicate Brandon's problem loading the ICD-10
dataset using any of the built-in OWL reasoners without search
indexing. However it did successfully load and respond fast to
queries using RDFSRuleReasoner, as well as Transitive and Generic.


OK - we're getting closer.

That "pump" loop could well be cause if it is from a rule with with (?x 
?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS 
omits that rule. This dataset is only 800K triples.


The rules engine copes with the schema and data changing during runtime 
with an engine that minimises re-computation at the expense of a lot 
more initial work and crucially tracking with in-memory state. I guess 
it is on first-touch doing all the setup work.


[Later: It is not specific to TDB - seems to happen with any base 
storage including both in-memory kinds.]



Brandon is better able to say whether we need OWL for other
reasons, but we do want to use ICD-10-CM with data for inference.
Would* Data with RDFS Inferencing* have advantages over using the
built-in RDFSRuleReasoner for that?


Maybe :-)

Data+RDFS is different - it's not trying to be a replacement for the 
rules engine for RDFS. We have the rules engine for complete adherence 
to RDFS.


Data+RDFS:
1/ It is a fixed RDFS (subclass/subproperty/domain/range).
   No axioms. No x:directSubClassOf.
2/ Applies to every graph in the dataset.
3/ Assumes the schema is fixed - no update to the schema at runtime.
4/ The schema is invisible - the app sees data and inferred triples.

but it should scale and work with persistent databases.

[ The "no update to the schema" could be changed. Programming needed 
though. ]


So - Ryan, Brandon - what inference does your usage need? Is the 
schema/ontology updated during runtime?


Andy



Thanks again for any help in advance,

Ryan

*JFYI, the Transitive- and RDFSRuleReasoners inferred*

*570k :subClassOf and an additional 192k :type triples over the base 96k of
each relation, respectively.*


*Profiling the OWL reasoner with VisualVM I was able to see that it seems
to cycle without end through*


*Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
Node.sameValueAs(). I have yet to try this on a reduced dataset to see if I
can find the minimum necessary to replicate the spin.*

On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne  wrote:


Hi Brandon,

The configuration is quite complex - it's likely due to the inference
layer but it would be worth trying without the text index to confirm
that especially for the loading.

Do you need all that

offers or is all you want RDFS subclass?

Because there is
https://jena.apache.org/documentation/rdfs/
(give ICD10CM as both data and also in a file to be the schema).

The schema is assumed to be fixed which might not work for you long term
but it is another data point to understand the situation.

About ICD10CM itseld - are you wanting to navigate its structure or use
it with data for inference? If it is to navigate its structure do you
even want inference?

  Andy

On 14/09/2021 00:42, Brandon Sara wrote:

I have been able to create an easily reproducible scenario that others

can use to replicate and test the issues that I’m seeing:


1. Start fuseki using the config that I’ve listed below.
2. Attempt to load the latest version of ICD-10 CM as provided freely by

BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM


If inference is enabled, then I can’t even get the turtle file to load

in its entirety. If I load the turtle file without inference, then the load
completes, but upon restarting the server and submitting a request, the
service doesn’t finish processing the request in any reasonable amount of
time, no matter how simple the query of the request is (one that actually
queries data from the dataset at least).


Config:

PREFIX dcterms: 
PREFIX fuseki: 
PREFIX ja: 
PREFIX rdf: 
PREFIX rdfs: 
PREFIX skos: 
PREFIX tdb2: 
PREFIX text: 

[] rdf:type fuseki:Server ;
fuseki:pingEP true ;
fuseki:statsEP true ;
fuseki:metricsEP true ;
fuseki:compactEP true ;

ja:context [
  ja:cxtName "arq:queryTimeout" ;
  ja:cxtValue "1,6" ;
] ;
.

<#kgService> a fuseki:Service ;
fuseki:name "kg" ;
fuseki:dataset <#kgIndexedDataset> ;
fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
fuseki:endpoint [ 

Re: Subclass caching has some problems on Fuseki startup

2021-09-17 Thread Ryan Stokes
Hi Andy,

By way of introduction I've been exploring ontology solutions
with Brandon recently using Jena and Fuseki and come to
appreciate your capable stewardship and responsive
engagement with this community. Thank you.

I was able to replicate Brandon's problem loading the ICD-10
dataset using any of the built-in OWL reasoners without search
indexing. However it did successfully load and respond fast to
queries using RDFSRuleReasoner, as well as Transitive and Generic.

Brandon is better able to say whether we need OWL for other
reasons, but we do want to use ICD-10-CM with data for inference.
Would* Data with RDFS Inferencing* have advantages over using the
built-in RDFSRuleReasoner for that?

Thanks again for any help in advance,

Ryan

*JFYI, the Transitive- and RDFSRuleReasoners inferred*

*570k :subClassOf and an additional 192k :type triples over the base 96k of
each relation, respectively.*


*Profiling the OWL reasoner with VisualVM I was able to see that it seems
to cycle without end through*


*Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
Node.sameValueAs(). I have yet to try this on a reduced dataset to see if I
can find the minimum necessary to replicate the spin.*

On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne  wrote:

> Hi Brandon,
>
> The configuration is quite complex - it's likely due to the inference
> layer but it would be worth trying without the text index to confirm
> that especially for the loading.
>
> Do you need all that
> 
> offers or is all you want RDFS subclass?
>
> Because there is
>https://jena.apache.org/documentation/rdfs/
> (give ICD10CM as both data and also in a file to be the schema).
>
> The schema is assumed to be fixed which might not work for you long term
> but it is another data point to understand the situation.
>
> About ICD10CM itseld - are you wanting to navigate its structure or use
> it with data for inference? If it is to navigate its structure do you
> even want inference?
>
>  Andy
>
> On 14/09/2021 00:42, Brandon Sara wrote:
> > I have been able to create an easily reproducible scenario that others
> can use to replicate and test the issues that I’m seeing:
> >
> > 1. Start fuseki using the config that I’ve listed below.
> > 2. Attempt to load the latest version of ICD-10 CM as provided freely by
> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
> >
> > If inference is enabled, then I can’t even get the turtle file to load
> in its entirety. If I load the turtle file without inference, then the load
> completes, but upon restarting the server and submitting a request, the
> service doesn’t finish processing the request in any reasonable amount of
> time, no matter how simple the query of the request is (one that actually
> queries data from the dataset at least).
> >
> > Config:
> >
> > PREFIX dcterms: 
> > PREFIX fuseki: 
> > PREFIX ja: 
> > PREFIX rdf: 
> > PREFIX rdfs: 
> > PREFIX skos: 
> > PREFIX tdb2: 
> > PREFIX text: 
> >
> > [] rdf:type fuseki:Server ;
> >fuseki:pingEP true ;
> >fuseki:statsEP true ;
> >fuseki:metricsEP true ;
> >fuseki:compactEP true ;
> >
> >ja:context [
> >  ja:cxtName "arq:queryTimeout" ;
> >  ja:cxtValue "1,6" ;
> >] ;
> > .
> >
> > <#kgService> a fuseki:Service ;
> >fuseki:name "kg" ;
> >fuseki:dataset <#kgIndexedDataset> ;
> >fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
> >fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
> >fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
> >fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data";
> ] ;
> > .
> >
> > <#kgIndexedDataset> rdf:type text:TextDataset ;
> >text:dataset <#kgInferredDataset> ;
> >text:index <#kgIndex> ;
> > .
> >
> > <#kgIndex> a text:TextIndexLucene ;
> >text:directory  ;
> >text:entityMap <#kgEntityMap> ;
> >text:storeValues true ;
> >text:queryParser [ a text:ComplexPhraseQueryParser ]
> > .
> >
> > <#kgEntityMap> a text:EntityMap ;
> >text:defaultField "label" ;
> >text:entityField "uri" ;
> >text:uidField "uid" ;
> >text:langField "lang" ;
> >text:graphField "graph" ;
> >text:map (
> >  [ text:field "id" ;
> >text:predicate dcterms:identifier ]
> >
> >  [ text:field "label" ;
> >text:predicate rdfs:label ]
> >) ;
> > .
> >
> > <#kgInferredDataset> a ja:RDFDataset ;
> >ja:defaultGraph <#kgInferenceModel> ;
> > .
> >
> > <#kgInferenceModel> a ja:InfModel ;
> >ja:baseModel <#kgTdbGraph> ;
> >ja:reasoner [
> >  ja:reasonerURL 
> >] ;
> > .
> 

Re: Subclass caching has some problems on Fuseki startup

2021-09-17 Thread Andy Seaborne

Hi Brandon,

The configuration is quite complex - it's likely due to the inference 
layer but it would be worth trying without the text index to confirm 
that especially for the loading.


Do you need all that

offers or is all you want RDFS subclass?

Because there is
  https://jena.apache.org/documentation/rdfs/
(give ICD10CM as both data and also in a file to be the schema).

The schema is assumed to be fixed which might not work for you long term 
but it is another data point to understand the situation.


About ICD10CM itseld - are you wanting to navigate its structure or use 
it with data for inference? If it is to navigate its structure do you 
even want inference?


Andy

On 14/09/2021 00:42, Brandon Sara wrote:

I have been able to create an easily reproducible scenario that others can use 
to replicate and test the issues that I’m seeing:

1. Start fuseki using the config that I’ve listed below.
2. Attempt to load the latest version of ICD-10 CM as provided freely by 
BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM

If inference is enabled, then I can’t even get the turtle file to load in its 
entirety. If I load the turtle file without inference, then the load completes, 
but upon restarting the server and submitting a request, the service doesn’t 
finish processing the request in any reasonable amount of time, no matter how 
simple the query of the request is (one that actually queries data from the 
dataset at least).

Config:

PREFIX dcterms: 
PREFIX fuseki: 
PREFIX ja: 
PREFIX rdf: 
PREFIX rdfs: 
PREFIX skos: 
PREFIX tdb2: 
PREFIX text: 

[] rdf:type fuseki:Server ;
   fuseki:pingEP true ;
   fuseki:statsEP true ;
   fuseki:metricsEP true ;
   fuseki:compactEP true ;

   ja:context [
 ja:cxtName "arq:queryTimeout" ;
 ja:cxtValue "1,6" ;
   ] ;
.

<#kgService> a fuseki:Service ;
   fuseki:name "kg" ;
   fuseki:dataset <#kgIndexedDataset> ;
   fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
   fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
   fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
   fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data"; ] ;
.

<#kgIndexedDataset> rdf:type text:TextDataset ;
   text:dataset <#kgInferredDataset> ;
   text:index <#kgIndex> ;
.

<#kgIndex> a text:TextIndexLucene ;
   text:directory  ;
   text:entityMap <#kgEntityMap> ;
   text:storeValues true ;
   text:queryParser [ a text:ComplexPhraseQueryParser ]
.

<#kgEntityMap> a text:EntityMap ;
   text:defaultField "label" ;
   text:entityField "uri" ;
   text:uidField "uid" ;
   text:langField "lang" ;
   text:graphField "graph" ;
   text:map (
 [ text:field "id" ;
   text:predicate dcterms:identifier ]

 [ text:field "label" ;
   text:predicate rdfs:label ]
   ) ;
.

<#kgInferredDataset> a ja:RDFDataset ;
   ja:defaultGraph <#kgInferenceModel> ;
.

<#kgInferenceModel> a ja:InfModel ;
   ja:baseModel <#kgTdbGraph> ;
   ja:reasoner [
 ja:reasonerURL 
   ] ;
.

<#kgTdbGraph> a tdb2:GraphTDB2 ;
   tdb2:dataset <#kgTdbDataset> ;
.

<#kgTdbDataset> a tdb2:DatasetTDB2 ;
   tdb2:location "/fuseki/databases/kg" ;
.



No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.



Re: Subclass caching has some problems on Fuseki startup

2021-09-13 Thread Brandon Sara
I have been able to create an easily reproducible scenario that others can use 
to replicate and test the issues that I’m seeing:

1. Start fuseki using the config that I’ve listed below.
2. Attempt to load the latest version of ICD-10 CM as provided freely by 
BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM

If inference is enabled, then I can’t even get the turtle file to load in its 
entirety. If I load the turtle file without inference, then the load completes, 
but upon restarting the server and submitting a request, the service doesn’t 
finish processing the request in any reasonable amount of time, no matter how 
simple the query of the request is (one that actually queries data from the 
dataset at least).

Config:

PREFIX dcterms: 
PREFIX fuseki: 
PREFIX ja: 
PREFIX rdf: 
PREFIX rdfs: 
PREFIX skos: 
PREFIX tdb2: 
PREFIX text: 

[] rdf:type fuseki:Server ;
  fuseki:pingEP true ;
  fuseki:statsEP true ;
  fuseki:metricsEP true ;
  fuseki:compactEP true ;

  ja:context [
ja:cxtName "arq:queryTimeout" ;
ja:cxtValue "1,6" ;
  ] ;
.

<#kgService> a fuseki:Service ;
  fuseki:name "kg" ;
  fuseki:dataset <#kgIndexedDataset> ;
  fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data"; ] ;
.

<#kgIndexedDataset> rdf:type text:TextDataset ;
  text:dataset <#kgInferredDataset> ;
  text:index <#kgIndex> ;
.

<#kgIndex> a text:TextIndexLucene ;
  text:directory  ;
  text:entityMap <#kgEntityMap> ;
  text:storeValues true ;
  text:queryParser [ a text:ComplexPhraseQueryParser ]
.

<#kgEntityMap> a text:EntityMap ;
  text:defaultField "label" ;
  text:entityField "uri" ;
  text:uidField "uid" ;
  text:langField "lang" ;
  text:graphField "graph" ;
  text:map (
[ text:field "id" ;
  text:predicate dcterms:identifier ]

[ text:field "label" ;
  text:predicate rdfs:label ]
  ) ;
.

<#kgInferredDataset> a ja:RDFDataset ;
  ja:defaultGraph <#kgInferenceModel> ;
.

<#kgInferenceModel> a ja:InfModel ;
  ja:baseModel <#kgTdbGraph> ;
  ja:reasoner [
ja:reasonerURL 
  ] ;
.

<#kgTdbGraph> a tdb2:GraphTDB2 ;
  tdb2:dataset <#kgTdbDataset> ;
.

<#kgTdbDataset> a tdb2:DatasetTDB2 ;
  tdb2:location "/fuseki/databases/kg" ;
.



No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.


Re: Subclass caching has some problems on Fuseki startup

2021-08-30 Thread Lorenz Buehmann



On 27.08.21 22:09, Brandon Sara wrote:

I’ve finally tracked down the problem (at least at a high level). When using 
the Transitive Reasoner, there is a block of code which caches all sub class 
triples 
(https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/reasoner/transitiveReasoner/TransitiveEngine.java#L316-L326).
 Part of this code searches for all sub properties of `subClassOf` and begins 
caching triples for those sub-properties. In my situation, I’ve added 
`owl:equivalentClass` manually (since only TransitiveReasoner` is being used) 
and manually made it a sub property of `subClassOf`.
in that case you're losing inferences from one direction, don't you? 
Wouldn't it be more "clean" if you resolve the owl:equivalentClass 
axioms beforehand by creating subClassOf axioms for both directions? 
Either by means of a SPARQL query or by wrapping a GenericRuleReasoner 
with two rules?

The data that I’m uploading right now has a lot of equivalent class triples 
(~>300k). It seems, if I’m understanding the code correctly as I’ve been 
debugging it, that not only is the triple cached…but a traversal of many other 
triples occurs when the caching occurs for even a single triple, is that correct? 
This would explain why (1) it never seems to finish what it is doing and (2) the 
memory grows very, very large while doing it. I ran a single query last night and 
after more than 6 hours, 8 CPUs, and 20GB of RAM, it still never finished loading 
the cache. It seems as though that the runtime of this could be exponential in 
nature. My dataset is well over 20 million records (maybe even more, I still 
haven’t gotten a full count yet, but I know for a fact that it is well over 10 
million and believe it to be well more than 20 million). Like I’ve mentioned 
before, there are basically no individuals in the dataset, it’s all ontology 
because it is health care industry coding systems and classifications.

Another strange thing, which I’ve mentioned before, is that I don’t have any of 
these issues when I initially load the data, I can load everything with just 4 
GB of RAM, it loads in a reasonable amount of time, and I can submit queries of 
pretty much any complexity after the upload is complete with no issues, and 
they are very fast too. This only occurs when the server has been restarted and 
the first query that actually pulls something from the dataset (I.E. not an 
empty query) is submitted (no matter how simple or complex that query may be).

Is this a bug or should `owl:equivalent` class work without my own manual 
specification of it?

No PHI in Email: PointClickCare and Collective Medical, A PointClickCare 
Company, policies prohibit sending protected health information (PHI) by email, 
which may violate regulatory requirements. If sending PHI is necessary, please 
contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information. Any unauthorized review, use, disclosure or 
distribution is prohibited. If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.




Re: Subclass caching has some problems on Fuseki startup

2021-08-29 Thread Dave Reynolds

On 27/08/2021 21:09, Brandon Sara wrote:
I’ve finally tracked down the problem (at least at a high level). When using the Transitive Reasoner, there is a block of code which caches all sub class triples (https://github.com/apache/jena/blob/main/jena-core/src/main/java/org/apache/jena/reasoner/transitiveReasoner/TransitiveEngine.java#L316-L326). Part of this code searches for all sub properties of `subClassOf` and begins caching triples for those sub-properties. In my situation, I’ve added `owl:equivalentClass` manually (since only TransitiveReasoner` is being used) and manually made it a sub property of `subClassOf`. The data that I’m uploading right now has a lot of equivalent class triples (~>300k). It seems, if I’m understanding the code correctly as I’ve been debugging it, that not only is the triple cached…but a traversal of many other triples occurs when the caching occurs for even a single triple, is that correct? This would explain why (1) it never seems to finish what it is doing and (2) the memory grows very, very large while doing it. I ran a single query last night and after more than 6 hours, 8 CPUs, and 20GB of RAM, it still never finished loading the cache. It seems as though that the runtime of this could be exponential in nature. 


Indeed it can be expensive. The transitive reasoner is doing a 
transitive reduction (finding direct links) not just a transitive 
closure. If I remember correctly this is somewhere between quadratic and 
cubic (something like O(|V|(|V| + |E|)) in the best case). It uses a 
standard but rather old algorithm for this but I think the algorithm is 
still polynomial not exponential. However, (a) there could be some cases 
that throw it off and (b) at 20m records then even quadratic would be 
high cost and it's likely to be closer a power of 2.5.



My dataset is well over 20 million records (maybe even more, I still haven’t 
gotten a full count yet, but I know for a fact that it is well over 10 million 
and believe it to be well more than 20 million). Like I’ve mentioned before, 
there are basically no individuals in the dataset, it’s all ontology because it 
is health care industry coding systems and classifications.



Another strange thing, which I’ve mentioned before, is that I don’t have any of 
these issues when I initially load the data, I can load everything with just 4 
GB of RAM, it loads in a reasonable amount of time, and I can submit queries of 
pretty much any complexity after the upload is complete with no issues, and 
they are very fast too. This only occurs when the server has been restarted and 
the first query that actually pulls something from the dataset (I.E. not an 
empty query) is submitted (no matter how simple or complex that query may be).


Can't explain that.


Is this a bug or should `owl:equivalent` class work without my own manual 
specification of it?


Depends what you mean by "work".

I'm afraid I've not been following your earlier posts so I'm not clear 
on what you are trying to achieve.


If you want to deduce new subClassOf (and perhaps equivalentClass) 
relationships from a mix of subClassOf and equivalentClass assertions 
just using the transitive reasoner, with no rule engine, then you would 
have to insert that manual relationship.


If you just want to store and retrieve the equivalentClass relationships 
with no inference then clearly you wouldn't need that extra assertion.


Dave