Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Alessandro Adamou Thu, 05 Apr 2012 03:48:16 -0700

Hi Rupert, here are a few more numbers:

on the same setting I loaded the NCI ontology fromhttp://www.mindswap.org/2003/CancerOntology/ (about 400k triples,lightly axiomatized with DL flavor ALE)

on the SingleTdbDatasetTcProvider the storage directory grew by 156 MiB(192 -> 348)

on the TdbTcProvider the newly created dir was 76 MiB above the initialcapacity (192 -> 268)

Then I bzipped both directories to see if it was partly "filling" theinitial 192 MiB :

- the SingleTdbDatasetTcProvider one shrunk to ~25 MiB
- the TdbTcProvider one shrunk to ~17 MiB

I guess this overhead is due to having to store a lot more quadruplesdue to the named graphs. I noticed that the files(GOSP|GPOS|GSPO|OSPG|POSG|SPOG).dat which I assume store quadruples areeach 4 times as large in the SingleTdbDatasetTcProvider database,whereas the triples (OSP|POS|SPO).dat were the same size. I guess thisredundancy is the price paid for fast access.

Perhaps mine is a fuzzy interpretation though? Still, it looks prettygood to me.


Best,

Alessandro


----------

On 4/4/12 7:31 PM, Rupert Westenthaler wrote:

On 04.04.2012, at 19:18, Alessandro Adamou wrote:

Hi Rupert, all,

just telling you that I have tried the SingleTdbDatasetTcProvider on the field 
with one of my use cases which involves many small ontologies (content design 
patterns).

I've created ~20 graphs totalling about 500 triples

On OS X 10.6.8 (on HFS+ filesystem with journalling) the database grew from an 
initial 184MiB to 248MiB

I am yet to test large graphs, so I cannot tell if the overhead is given by 
named graph indexes or the triple storage, but this is already a big leap from 
the TdbTcProvider.

Thx for testing.

Did you already commit this component to rdf.jena.tdb.storage ?

No not yet, but I have made some improvements and fixed some bugs since the 
last patch attached to the Issue. I hope I will have some time to finish this 
later this week.

best
Rupert

Best,

Alessandro

On 3/19/12 9:16 AM, Hasan Hasan wrote:

Hi all,

I generally agree to extend Clerezza to be able to support multiple
requirements. Thus, I see the necessity of SingleDatasetTdbTcProvide.
Although I am bit unhappy, due to the fact, that application developers
have to be aware of this.
Note that, new clerezza instances (at least my own build) do not anymore
generate 200 MB of index files for empty graphs, but merely 200K.

Regards
Hasan


On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler<
[email protected]>   wrote:

Hi David, stanbol&   clerezza community

Short summary of the situation:

The Ontonet component generate a lot of MGraphs using the Jena TDB
provider. This causes the disc consumption and number of open files to
explode. See the quoted emails for details


@Stanbol  we are already discussion how to avoid the creation of such many
graphs


@Clerezza the observed behavior of the TDB provider is also very dangerous
(at least for typical use cases in Apache Stanbol).

Even targeting at a different CLEREZZA-467 maybe provides a possible
solution for that as it suggests to use named graphs instead of isolated
TDB instances for creating MGraphs.

To be honest this would be the optimal solution for our usages of Clerezza
in Stanbol. However I assume that for a semantic CMS it is saver to use
different TDB datasets.

Because of that I  would like to make the following proposal that
hopefully covers both the needs of Apache Stanbol and Apache Clerezza.

1. AbstractTdbTcProvider: providing most of the functionality needed to
store Clerezza MGraphs in Jena TDB

2. TdbTcProvider: The same as now but now extending the abstract one. I
follows the currently used methodology to map Clerezza graphs to separate
TDB datasets

3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
MGraphs in a single TDB dataset. This provider should also support
"configurationFactory=true" (multiple instances). each instance would use a
different TDB dataset to store its MGrpahs.

By default the SingleDatasetTdbTcProvider would be inactive, because it
requires a configuration of the directory for the  TDB dataset as well as a
name (that can be used in Filters). This ensures full backward
compatibility.

In environment - such as Stanbol - where you want to store multiple graphs
in the same TDB dataset you would need to provide a configuration for the
SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:

* if you just need a single TDB dataset that stores all MGraphs, than you
can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
and normally use the TcManager to create your graphs.
* if you want to use single TDB datasets or a mix of the TdbTcProvider and
SingleDatasetTdbTcProvider's you will need to use according filters.


WDYT
Rupert


[1] https://issues.apache.org/jira/browse/CLEREZZA-467

On 16.03.2012, at 10:44, Rupert Westenthaler wrote:

Hi David, all

this could be the explanation for the failed build on the Jenkins server

when the SEO configuration for the Refactor engine was used in the default
configuration of the Full launcher

see http://markmail.org/message/sprwklaobdjankig for details.

For me that looks like as if the RefactorEngine does create multiple

Jena TDB instances for various created MGraphs. One needs to know the even
for an empty graph Jena TDB creates ~200MByte of index files. So it is
important to map multiple MGraphs to different named graphs of the same
Jena TDB store.

I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,

but I hope this can help in tracing this down.

best
Rupert

On 16.03.2012, at 10:30, David Riccitelli wrote:

Dears,

As I ran into disk issues, I found that this folder:
sling/felix/bundleXXX/data/tdb-data/mgraph

where XX is the bundle of:
Clerezza - SCB Jena TDB Storage Provider
org.apache.clerezza.rdf.jena.tdb.storage

took almost 70 gbytes of disk space (then the disk space has been
exhausted).

These are some of the files I found inside:
193M ./ontonet%3A%3Ainputstream%3Aontology889
193M ./ontonet%3A%3Ainputstream%3Aontology1041
193M ./ontonet%3A%3Ainputstream%3Aontology395
193M ./ontonet%3A%3Ainputstream%3Aontology363
193M ./ontonet%3A%3Ainputstream%3Aontology661
193M ./ontonet%3A%3Ainputstream%3Aontology786
193M ./ontonet%3A%3Ainputstream%3Aontology608
193M ./ontonet%3A%3Ainputstream%3Aontology213
193M ./ontonet%3A%3Ainputstream%3Aontology188
193M ./ontonet%3A%3Ainputstream%3Aontology602


Any clues?

Thanks,
David Riccitelli

********************************************************************************

InsideOut10 s.r.l.
P.IVA: IT-11381771002
Fax: +39 0110708239
---
LinkedIn: http://it.linkedin.com/in/riccitelli
Twitter: ziodave
---
Layar Partner Network<

http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
********************************************************************************


--
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice



--
M.Sc. Alessandro Adamou

Alma Mater Studiorum - Università di Bologna
Department of Computer Science
Mura Anteo Zamboni 7, 40127 Bologna - Italy

Semantic Technology Laboratory (STLab)
Institute for Cognitive Science and Technology (ISTC)
National Research Council (CNR)
Via Nomentana 56, 00161 Rome - Italy


"I will give you everything, so long as you do not demand anything."
(Ettore Petrolini, 1930)

Not sent from my iSnobTechDevice

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Reply via email to