Re: Persistent Model Implementation

2018-10-23 Thread Daan Reid

Hi,

It may not fit your use case precisely, but we've had some success 
combining the event-source pattern with caching to create datasets with 
history. The use of Jena's deltas lets us only persist changesets.


This may also be of general interest to the list.

https://github.com/drugis/jena-es

Regards,

Daan Reid

On 22-10-18 12:49, Kevin Dreßler wrote:

Thanks for your quick reply!


On 22. Oct 2018, at 12:19, ajs6f  wrote:

The TIM dataset implementation [1] is backed by persistent data structures (for the 
confused, the term "persistent" here means in the sense of immutable [2]-- it 
has nothing to do with disk storage). However, nothing there goes beyond the 
Node/Triple/Graph/DatasetGraph SPI-- the underlying structures aren't exposed and can't 
be reused by clients.


This looks interesting but I don't think it actually matches my use case. 
However, I think I would want a transactional commit in my implementation to 
improve performance so that I could collect a set of statements and only create 
a new immutable instance of the model when committing all of these together 
instead of after each single statement.


This sounds like an interesting and powerful use case, although I'm not sure 
how easily it could be accomplished within the current API. For one thing, we 
don't have a good way of distinguishing mutable and immutable models in Jena's 
type system right now.

Are the "k new Models" both adding and removing triples? If they're just adding 
triples, perhaps a clever wrapper might work.


Both addition and deletion of triples is possible. But the wrapper idea is nice 
and might actually work for both addition and deletion, as I could try to cache 
a set of Statements that have been deleted as long as this caches size is under 
x% of the base models size.


Otherwise, have you tried using an intermediating caching setup, wherein 
statements that are copied are routed through a cache that prevents 
duplication? I believe Andy deployed a similar technique for some of the TDB 
loading code and saw great improvement therefrom.


I just started researching this so I haven't done anything in this direction. 
Do you believe the wrapper / caching approach would be feasible with the 
current API? I am not very familiar with Jenas implementations but from my 
experience with the API it seems that every RDFNode has a reference to the 
model from which it was retrieved (if any). So in order to not violate API 
contracts I think I would also need to wrap each resource upon retrieval to 
point to the wrapper model instead of the base model?


ajs6f

[1] https://jena.apache.org/documentation/rdf/datasets.html
[2] https://en.wikipedia.org/wiki/Persistent_data_structure


On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  wrote:

Hello everyone,

I have an application using Jena where I frequently have to create copies of 
Models in order to then process them individually, i.e. all triples of one 
source Model are added to k new Models which are then mutated.

For larger Models this obviously takes some time and, more relevant for me, 
creates a considerable amount of memory pressure.
However, with a Model implementation based on persistent data structures I 
could eliminate most of these issues as the amount of data changed is typically 
under 5% compared to the overall Model size.

Has anyone ever done something like this before, i.e. are there immutable Model 
implementations with structural sharing that someone is aware of? If not what 
would be your advice on how one would approach implementing this in their own 
code base?

Best regards,
Kevin





Re: Persistent Model Implementation

2018-10-22 Thread Andy Seaborne

Also:

There is a Union graph - this would be useful if you are not deleting 
triples.  It is keeps one of the graphs untouched.


Andy

On 22/10/2018 12:36, Andy Seaborne wrote:
 >>> I have an application using Jena where I frequently have to create 
copies of Models in order to then process them individually, i.e. all 
triples of one source Model are added to k new Models which are then 
mutated.


When the lower level Graph (not Model) is copied in a JVM there is still 
sharing. The RDF terms, Nodes, URIs, blank node literal, are not 
duplicated.


RDFNode (really, EnhNode) is a pair of pointers (graph, node) but it is 
not used in the datastructures for the graph so they are transient and 
the GC recycles them.


You can think of Model as a presentation of the basic storage - the Graph.

     Andy


On 22/10/2018 11:49, Kevin Dreßler wrote:

Thanks for your quick reply!


On 22. Oct 2018, at 12:19, ajs6f  wrote:

The TIM dataset implementation [1] is backed by persistent data 
structures (for the confused, the term "persistent" here means in the 
sense of immutable [2]-- it has nothing to do with disk storage). 
However, nothing there goes beyond the Node/Triple/Graph/DatasetGraph 
SPI-- the underlying structures aren't exposed and can't be reused by 
clients.


This looks interesting but I don't think it actually matches my use 
case. However, I think I would want a transactional commit in my 
implementation to improve performance so that I could collect a set of 
statements and only create a new immutable instance of the model when 
committing all of these together instead of after each single statement.


This sounds like an interesting and powerful use case, although I'm 
not sure how easily it could be accomplished within the current API. 
For one thing, we don't have a good way of distinguishing mutable and 
immutable models in Jena's type system right now.


Are the "k new Models" both adding and removing triples? If they're 
just adding triples, perhaps a clever wrapper might work.


Both addition and deletion of triples is possible. But the wrapper 
idea is nice and might actually work for both addition and deletion, 
as I could try to cache a set of Statements that have been deleted as 
long as this caches size is under x% of the base models size.


Otherwise, have you tried using an intermediating caching setup, 
wherein statements that are copied are routed through a cache that 
prevents duplication? I believe Andy deployed a similar technique for 
some of the TDB loading code and saw great improvement therefrom.


I just started researching this so I haven't done anything in this 
direction. Do you believe the wrapper / caching approach would be 
feasible with the current API? I am not very familiar with Jenas 
implementations but from my experience with the API it seems that 
every RDFNode has a reference to the model from which it was retrieved 
(if any). So in order to not violate API contracts I think I would 
also need to wrap each resource upon retrieval to point to the wrapper 
model instead of the base model?



ajs6f

[1] https://jena.apache.org/documentation/rdf/datasets.html
[2] https://en.wikipedia.org/wiki/Persistent_data_structure

On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  
wrote:


Hello everyone,

I have an application using Jena where I frequently have to create 
copies of Models in order to then process them individually, i.e. 
all triples of one source Model are added to k new Models which are 
then mutated.


For larger Models this obviously takes some time and, more relevant 
for me, creates a considerable amount of memory pressure.
However, with a Model implementation based on persistent data 
structures I could eliminate most of these issues as the amount of 
data changed is typically under 5% compared to the overall Model size.


Has anyone ever done something like this before, i.e. are there 
immutable Model implementations with structural sharing that someone 
is aware of? If not what would be your advice on how one would 
approach implementing this in their own code base?


Best regards,
Kevin





Re: Persistent Model Implementation

2018-10-22 Thread Andy Seaborne
>>> I have an application using Jena where I frequently have to create 
copies of Models in order to then process them individually, i.e. all 
triples of one source Model are added to k new Models which are then 
mutated.


When the lower level Graph (not Model) is copied in a JVM there is still 
sharing. The RDF terms, Nodes, URIs, blank node literal, are not duplicated.


RDFNode (really, EnhNode) is a pair of pointers (graph, node) but it is 
not used in the datastructures for the graph so they are transient and 
the GC recycles them.


You can think of Model as a presentation of the basic storage - the Graph.

Andy


On 22/10/2018 11:49, Kevin Dreßler wrote:

Thanks for your quick reply!


On 22. Oct 2018, at 12:19, ajs6f  wrote:

The TIM dataset implementation [1] is backed by persistent data structures (for the 
confused, the term "persistent" here means in the sense of immutable [2]-- it 
has nothing to do with disk storage). However, nothing there goes beyond the 
Node/Triple/Graph/DatasetGraph SPI-- the underlying structures aren't exposed and can't 
be reused by clients.


This looks interesting but I don't think it actually matches my use case. 
However, I think I would want a transactional commit in my implementation to 
improve performance so that I could collect a set of statements and only create 
a new immutable instance of the model when committing all of these together 
instead of after each single statement.


This sounds like an interesting and powerful use case, although I'm not sure 
how easily it could be accomplished within the current API. For one thing, we 
don't have a good way of distinguishing mutable and immutable models in Jena's 
type system right now.

Are the "k new Models" both adding and removing triples? If they're just adding 
triples, perhaps a clever wrapper might work.


Both addition and deletion of triples is possible. But the wrapper idea is nice 
and might actually work for both addition and deletion, as I could try to cache 
a set of Statements that have been deleted as long as this caches size is under 
x% of the base models size.


Otherwise, have you tried using an intermediating caching setup, wherein 
statements that are copied are routed through a cache that prevents 
duplication? I believe Andy deployed a similar technique for some of the TDB 
loading code and saw great improvement therefrom.


I just started researching this so I haven't done anything in this direction. 
Do you believe the wrapper / caching approach would be feasible with the 
current API? I am not very familiar with Jenas implementations but from my 
experience with the API it seems that every RDFNode has a reference to the 
model from which it was retrieved (if any). So in order to not violate API 
contracts I think I would also need to wrap each resource upon retrieval to 
point to the wrapper model instead of the base model?


ajs6f

[1] https://jena.apache.org/documentation/rdf/datasets.html
[2] https://en.wikipedia.org/wiki/Persistent_data_structure


On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  wrote:

Hello everyone,

I have an application using Jena where I frequently have to create copies of 
Models in order to then process them individually, i.e. all triples of one 
source Model are added to k new Models which are then mutated.

For larger Models this obviously takes some time and, more relevant for me, 
creates a considerable amount of memory pressure.
However, with a Model implementation based on persistent data structures I 
could eliminate most of these issues as the amount of data changed is typically 
under 5% compared to the overall Model size.

Has anyone ever done something like this before, i.e. are there immutable Model 
implementations with structural sharing that someone is aware of? If not what 
would be your advice on how one would approach implementing this in their own 
code base?

Best regards,
Kevin





Re: Persistent Model Implementation

2018-10-22 Thread Kevin Dreßler
Thanks for your quick reply!

> On 22. Oct 2018, at 12:19, ajs6f  wrote:
> 
> The TIM dataset implementation [1] is backed by persistent data structures 
> (for the confused, the term "persistent" here means in the sense of immutable 
> [2]-- it has nothing to do with disk storage). However, nothing there goes 
> beyond the Node/Triple/Graph/DatasetGraph SPI-- the underlying structures 
> aren't exposed and can't be reused by clients.

This looks interesting but I don't think it actually matches my use case. 
However, I think I would want a transactional commit in my implementation to 
improve performance so that I could collect a set of statements and only create 
a new immutable instance of the model when committing all of these together 
instead of after each single statement.

> This sounds like an interesting and powerful use case, although I'm not sure 
> how easily it could be accomplished within the current API. For one thing, we 
> don't have a good way of distinguishing mutable and immutable models in 
> Jena's type system right now.
> 
> Are the "k new Models" both adding and removing triples? If they're just 
> adding triples, perhaps a clever wrapper might work.

Both addition and deletion of triples is possible. But the wrapper idea is nice 
and might actually work for both addition and deletion, as I could try to cache 
a set of Statements that have been deleted as long as this caches size is under 
x% of the base models size.

> Otherwise, have you tried using an intermediating caching setup, wherein 
> statements that are copied are routed through a cache that prevents 
> duplication? I believe Andy deployed a similar technique for some of the TDB 
> loading code and saw great improvement therefrom.

I just started researching this so I haven't done anything in this direction. 
Do you believe the wrapper / caching approach would be feasible with the 
current API? I am not very familiar with Jenas implementations but from my 
experience with the API it seems that every RDFNode has a reference to the 
model from which it was retrieved (if any). So in order to not violate API 
contracts I think I would also need to wrap each resource upon retrieval to 
point to the wrapper model instead of the base model?

> ajs6f
> 
> [1] https://jena.apache.org/documentation/rdf/datasets.html
> [2] https://en.wikipedia.org/wiki/Persistent_data_structure
> 
>> On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  wrote:
>> 
>> Hello everyone,
>> 
>> I have an application using Jena where I frequently have to create copies of 
>> Models in order to then process them individually, i.e. all triples of one 
>> source Model are added to k new Models which are then mutated.
>> 
>> For larger Models this obviously takes some time and, more relevant for me, 
>> creates a considerable amount of memory pressure.
>> However, with a Model implementation based on persistent data structures I 
>> could eliminate most of these issues as the amount of data changed is 
>> typically under 5% compared to the overall Model size.
>> 
>> Has anyone ever done something like this before, i.e. are there immutable 
>> Model implementations with structural sharing that someone is aware of? If 
>> not what would be your advice on how one would approach implementing this in 
>> their own code base?
>> 
>> Best regards,
>> Kevin



Re: Persistent Model Implementation

2018-10-22 Thread ajs6f
The TIM dataset implementation [1] is backed by persistent data structures (for 
the confused, the term "persistent" here means in the sense of immutable [2]-- 
it has nothing to do with disk storage). However, nothing there goes beyond the 
Node/Triple/Graph/DatasetGraph SPI-- the underlying structures aren't exposed 
and can't be reused by clients.

This sounds like an interesting and powerful use case, although I'm not sure 
how easily it could be accomplished within the current API. For one thing, we 
don't have a good way of distinguishing mutable and immutable models in Jena's 
type system right now.

Are the "k new Models" both adding and removing triples? If they're just adding 
triples, perhaps a clever wrapper might work.

Otherwise, have you tried using an intermediating caching setup, wherein 
statements that are copied are routed through a cache that prevents 
duplication? I believe Andy deployed a similar technique for some of the TDB 
loading code and saw great improvement therefrom.

ajs6f

[1] https://jena.apache.org/documentation/rdf/datasets.html
[2] https://en.wikipedia.org/wiki/Persistent_data_structure

> On Oct 22, 2018, at 12:08 PM, Kevin Dreßler  wrote:
> 
> Hello everyone,
> 
> I have an application using Jena where I frequently have to create copies of 
> Models in order to then process them individually, i.e. all triples of one 
> source Model are added to k new Models which are then mutated.
> 
> For larger Models this obviously takes some time and, more relevant for me, 
> creates a considerable amount of memory pressure.
> However, with a Model implementation based on persistent data structures I 
> could eliminate most of these issues as the amount of data changed is 
> typically under 5% compared to the overall Model size.
> 
> Has anyone ever done something like this before, i.e. are there immutable 
> Model implementations with structural sharing that someone is aware of? If 
> not what would be your advice on how one would approach implementing this in 
> their own code base?
> 
> Best regards,
> Kevin