Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Reto Bachmann-Gmür Thu, 05 Apr 2012 04:00:06 -0700

Hi Rupert,

I like your proposal but would suggest:
- SingleDatasetTdbTcProvider should not need a directory configured
- SingleDatasetTdbTcProvider should have the higher weight and thus be the
one used by default


I think there might be usecases were you want an graph to be isolated from
the rests, but I think the default behaviour should be the more perfomant
and less memory expensive SingleDatasetTdbTcProvider.

We could add a tool to clerezza that allows creating an mgraph in a
tc-provider other than the one with the highest weight.

Cheers,
Reto

On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler <
[email protected]> wrote:

> Hi David, stanbol & clerezza community
>
> Short summary of the situation:
>
> The Ontonet component generate a lot of MGraphs using the Jena TDB
> provider. This causes the disc consumption and number of open files to
> explode. See the quoted emails for details
>
>
> @Stanbol  we are already discussion how to avoid the creation of such many
> graphs
>
>
> @Clerezza the observed behavior of the TDB provider is also very dangerous
> (at least for typical use cases in Apache Stanbol).
>
> Even targeting at a different CLEREZZA-467 maybe provides a possible
> solution for that as it suggests to use named graphs instead of isolated
> TDB instances for creating MGraphs.
>
> To be honest this would be the optimal solution for our usages of Clerezza
> in Stanbol. However I assume that for a semantic CMS it is saver to use
> different TDB datasets.
>
> Because of that I  would like to make the following proposal that
> hopefully covers both the needs of Apache Stanbol and Apache Clerezza.
>
> 1. AbstractTdbTcProvider: providing most of the functionality needed to
> store Clerezza MGraphs in Jena TDB
>
> 2. TdbTcProvider: The same as now but now extending the abstract one. I
> follows the currently used methodology to map Clerezza graphs to separate
> TDB datasets
>
> 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all
> MGraphs in a single TDB dataset. This provider should also support
> "configurationFactory=true" (multiple instances). each instance would use a
> different TDB dataset to store its MGrpahs.
>
> By default the SingleDatasetTdbTcProvider would be inactive, because it
> requires a configuration of the directory for the  TDB dataset as well as a
> name (that can be used in Filters). This ensures full backward
> compatibility.
>
> In environment - such as Stanbol - where you want to store multiple graphs
> in the same TDB dataset you would need to provide a configuration for the
> SingleDatasetTdbTcProvider. Here you have two possible usage scenarios:
>
> * if you just need a single TDB dataset that stores all MGraphs, than you
> can assign a high enough service.ranking to the SingleDatasetTdbTcProvider
> and normally use the TcManager to create your graphs.
> * if you want to use single TDB datasets or a mix of the TdbTcProvider and
> SingleDatasetTdbTcProvider's you will need to use according filters.
>
>
> WDYT
> Rupert
>
>
> [1] https://issues.apache.org/jira/browse/CLEREZZA-467
>
> On 16.03.2012, at 10:44, Rupert Westenthaler wrote:
>
> > Hi David, all
> >
> > this could be the explanation for the failed build on the Jenkins server
> when the SEO configuration for the Refactor engine was used in the default
> configuration of the Full launcher
> >
> > see http://markmail.org/message/sprwklaobdjankig for details.
> >
> > For me that looks like as if the RefactorEngine does create multiple
> Jena TDB instances for various created MGraphs. One needs to know the even
> for an empty graph Jena TDB creates ~200MByte of index files. So it is
> important to map multiple MGraphs to different named graphs of the same
> Jena TDB store.
> >
> > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs,
> but I hope this can help in tracing this down.
> >
> > best
> > Rupert
> >
> > On 16.03.2012, at 10:30, David Riccitelli wrote:
> >
> >> Dears,
> >>
> >> As I ran into disk issues, I found that this folder:
> >> sling/felix/bundleXXX/data/tdb-data/mgraph
> >>
> >> where XX is the bundle of:
> >> Clerezza - SCB Jena TDB Storage Provider
> >> org.apache.clerezza.rdf.jena.tdb.storage
> >>
> >> took almost 70 gbytes of disk space (then the disk space has been
> >> exhausted).
> >>
> >> These are some of the files I found inside:
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology889
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology395
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology363
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology661
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology786
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology608
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology213
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology188
> >> 193M ./ontonet%3A%3Ainputstream%3Aontology602
> >>
> >>
> >> Any clues?
> >>
> >> Thanks,
> >> David Riccitelli
> >>
> >>
> ********************************************************************************
> >> InsideOut10 s.r.l.
> >> P.IVA: IT-11381771002
> >> Fax: +39 0110708239
> >> ---
> >> LinkedIn: http://it.linkedin.com/in/riccitelli
> >> Twitter: ziodave
> >> ---
> >> Layar Partner Network<
> http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1
> >
> >>
> ********************************************************************************
> >
>
>

Re: clerezza.rdf.jena.tdb.storage filling up with ontonet files

Reply via email to