PrivilegedMGraphWrapper#getGraph() create in-memory copy (was Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files)

Rupert Westenthaler Tue, 20 Mar 2012 00:43:14 -0700

Hi all,

While working on the SingleTdbDatasetTcProvider I noticed that the


    PrivilegedMGraphWrapper#getGraph()

calls

        public Graph getGraph() {
                return new SimpleGraph(this);
        }

If I am right this causes an in-memory copy of the the wrapped MGraph to be 
created. Is there a special reason for that or should that? 

I would rather expect an PrivilegedGraphWrapper  wrapping the graph returned by 
the wrapped MGraph to be returned. Something like.

        public Graph getGraph() {
                return AccessController.doPrivileged(new 
PrivilegedAction<Graph>() {

                        @Override
                        public Graph run() {
                                return new 
PrivilegedGraphWrapper(wrapped.getGraph());
                        }
                });
        }

Maybe one would even like to have only a single PrivilegedGraphWrapper that is 
created on the first call to getGraph()

best
Rupert

On 19.03.2012, at 10:45, Daniel Spicar wrote:

> There are a couple of things to keep in mind. I think they are both handled
> on a higher layer and should work transparently but it's good to keep it in
> mind.
> 1. Graph permissions need to work. I think they work via the graph
> URI/name, so they may be handled transparently.



> 2. Make sure rdf.storage.externalizer works with your solution.
> 
> Best,
> Daniel
> 
> On 19 March 2012 09:16, Hasan Hasan <[email protected]> wrote:
> 
>> Hi all,
>> 
>> I generally agree to extend Clerezza to be able to support multiple
>> requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
>> Although I am bit unhappy, due to the fact, that application developers
>> have to be aware of this.
>> Note that, new clerezza instances (at least my own build) do not anymore
>> generate 200 MB of index files for empty graphs, but merely 200K.
>> 
>> Regards
>> Hasan
>> 
>> 
>> On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
>> [email protected]> wrote:
>> 
>>> Hi David, stanbol & clerezza community
>>> 
>>> Short summary of the situation:
>>> 
>>> The Ontonet component generate a lot of MGraphs using the Jena TDB
>>> provider. This causes the disc consumption and number of open files to
>>> explode. See the quoted emails for details
>>> 
>>> 
>>> @Stanbol  we are already discussion how to avoid the creation of such
>> many
>>> graphs
>>> 
>>> 
>>> @Clerezza the observed behavior of the TDB provider is also very
>> dangerous
>>> (at least for typical use cases in Apache Stanbol).
>>> 
>>> Even targeting at a different CLEREZZA-467 maybe provides a possible
>>> solution for that as it suggests to use named graphs instead of isolated
>>> TDB instances for creating MGraphs.
>>> 
>>> To be honest this would be the optimal solution for our usages of
>> Clerezza
>>> in Stanbol. However I assume that for a semantic CMS it is saver to use
>>> different TDB datasets.
>>> 
>>> Because of that I  would like to make the following proposal that
>>> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>>> 
>>> 1. AbstractTdbTcProvider: providing most of the functionality needed to
>>> store Clerezza MGraphs in Jena TDB
>>> 
>>> 2. TdbTcProvider: The same as now but now extending the abstract one. I
>>> follows the currently used methodology to map Clerezza graphs to separate
>>> TDB datasets
>>> 
>>> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
>>> MGraphs in a single TDB dataset. This provider should also support
>>> "configurationFactory=true" (multiple instances). each instance would
>> use a
>>> different TDB dataset to store its MGrpahs.
>>> 
>>> By default the SingleDatasetTdbTcProvider would be inactive, because it
>>> requires a configuration of the directory for the  TDB dataset as well
>> as a
>>> name (that can be used in Filters). This ensures full backward
>>> compatibility.
>>> 
>>> In environment - such as Stanbol - where you want to store multiple
>> graphs
>>> in the same TDB dataset you would need to provide a configuration for the
>>> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>>> 
>>> * if you just need a single TDB dataset that stores all MGraphs, than you
>>> can assign a high enough service.ranking to the
>> SingleDatasetTdbTcProvider
>>> and normally use the TcManager to create your graphs.
>>> * if you want to use single TDB datasets or a mix of the TdbTcProvider
>> and
>>> SingleDatasetTdbTcProvider's you will need to use according filters.
>>> 
>>> 
>>> WDYT
>>> Rupert
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>>> 
>>> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>>> 
>>>> Hi David, all
>>>> 
>>>> this could be the explanation for the failed build on the Jenkins
>> server
>>> when the SEO configuration for the Refactor engine was used in the
>> default
>>> configuration of the Full launcher
>>>> 
>>>> see http://markmail.org/message/sprwklaobdjankig for details.
>>>> 
>>>> For me that looks like as if the RefactorEngine does create multiple
>>> Jena TDB instances for various created MGraphs. One needs to know the
>> even
>>> for an empty graph Jena TDB creates ~200MByte of index files. So it is
>>> important to map multiple MGraphs to different named graphs of the same
>>> Jena TDB store.
>>>> 
>>>> I have no Idea how Clerezza manages this or how Ontonet creates
>> MGraphs,
>>> but I hope this can help in tracing this down.
>>>> 
>>>> best
>>>> Rupert
>>>> 
>>>> On 16.03.2012, at 10:30, David Riccitelli wrote:
>>>> 
>>>>> Dears,
>>>>> 
>>>>> As I ran into disk issues, I found that this folder:
>>>>> sling/felix/bundleXXX/data/tdb-data/mgraph
>>>>> 
>>>>> where XX is the bundle of:
>>>>> Clerezza - SCB Jena TDB Storage Provider
>>>>> org.apache.clerezza.rdf.jena.tdb.storage
>>>>> 
>>>>> took almost 70 gbytes of disk space (then the disk space has been
>>>>> exhausted).
>>>>> 
>>>>> These are some of the files I found inside:
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology889
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology395
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology363
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology661
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology786
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology608
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology213
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology188
>>>>> 193M ./ontonet%3A%3Ainputstream%3Aontology602
>>>>> 
>>>>> 
>>>>> Any clues?
>>>>> 
>>>>> Thanks,
>>>>> David Riccitelli
>>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>>> InsideOut10 s.r.l.
>>>>> P.IVA: IT-11381771002
>>>>> Fax: +39 0110708239
>>>>> ---
>>>>> LinkedIn: http://it.linkedin.com/in/riccitelli
>>>>> Twitter: ziodave
>>>>> ---
>>>>> Layar Partner Network<
>>> 
>> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
>>>> 
>>>>> 
>>> 
>> ********************************************************************************
>>>> 
>>> 
>>> 
>>

PrivilegedMGraphWrapper#getGraph() create in-memory copy (was Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files)

Reply via email to