Hi Rupert, I like your proposal but would suggest: - SingleDatasetTdbTcProvider should not need a directory configured - SingleDatasetTdbTcProvider should have the higher weight and thus be the one used by default
I think there might be usecases were you want an graph to be isolated from the rests, but I think the default behaviour should be the more perfomant and less memory expensive SingleDatasetTdbTcProvider. We could add a tool to clerezza that allows creating an mgraph in a tc-provider other than the one with the highest weight. Cheers, Reto On Fri, Mar 16, 2012 at 2:10 PM, Rupert Westenthaler < [email protected]> wrote: > Hi David, stanbol & clerezza community > > Short summary of the situation: > > The Ontonet component generate a lot of MGraphs using the Jena TDB > provider. This causes the disc consumption and number of open files to > explode. See the quoted emails for details > > > @Stanbol we are already discussion how to avoid the creation of such many > graphs > > > @Clerezza the observed behavior of the TDB provider is also very dangerous > (at least for typical use cases in Apache Stanbol). > > Even targeting at a different CLEREZZA-467 maybe provides a possible > solution for that as it suggests to use named graphs instead of isolated > TDB instances for creating MGraphs. > > To be honest this would be the optimal solution for our usages of Clerezza > in Stanbol. However I assume that for a semantic CMS it is saver to use > different TDB datasets. > > Because of that I would like to make the following proposal that > hopefully covers both the needs of Apache Stanbol and Apache Clerezza. > > 1. AbstractTdbTcProvider: providing most of the functionality needed to > store Clerezza MGraphs in Jena TDB > > 2. TdbTcProvider: The same as now but now extending the abstract one. I > follows the currently used methodology to map Clerezza graphs to separate > TDB datasets > > 3. SingleDatasetTdbTcProvider: Tdb provider variant that stores all > MGraphs in a single TDB dataset. This provider should also support > "configurationFactory=true" (multiple instances). each instance would use a > different TDB dataset to store its MGrpahs. > > By default the SingleDatasetTdbTcProvider would be inactive, because it > requires a configuration of the directory for the TDB dataset as well as a > name (that can be used in Filters). This ensures full backward > compatibility. > > In environment - such as Stanbol - where you want to store multiple graphs > in the same TDB dataset you would need to provide a configuration for the > SingleDatasetTdbTcProvider. Here you have two possible usage scenarios: > > * if you just need a single TDB dataset that stores all MGraphs, than you > can assign a high enough service.ranking to the SingleDatasetTdbTcProvider > and normally use the TcManager to create your graphs. > * if you want to use single TDB datasets or a mix of the TdbTcProvider and > SingleDatasetTdbTcProvider's you will need to use according filters. > > > WDYT > Rupert > > > [1] https://issues.apache.org/jira/browse/CLEREZZA-467 > > On 16.03.2012, at 10:44, Rupert Westenthaler wrote: > > > Hi David, all > > > > this could be the explanation for the failed build on the Jenkins server > when the SEO configuration for the Refactor engine was used in the default > configuration of the Full launcher > > > > see http://markmail.org/message/sprwklaobdjankig for details. > > > > For me that looks like as if the RefactorEngine does create multiple > Jena TDB instances for various created MGraphs. One needs to know the even > for an empty graph Jena TDB creates ~200MByte of index files. So it is > important to map multiple MGraphs to different named graphs of the same > Jena TDB store. > > > > I have no Idea how Clerezza manages this or how Ontonet creates MGraphs, > but I hope this can help in tracing this down. > > > > best > > Rupert > > > > On 16.03.2012, at 10:30, David Riccitelli wrote: > > > >> Dears, > >> > >> As I ran into disk issues, I found that this folder: > >> sling/felix/bundleXXX/data/tdb-data/mgraph > >> > >> where XX is the bundle of: > >> Clerezza - SCB Jena TDB Storage Provider > >> org.apache.clerezza.rdf.jena.tdb.storage > >> > >> took almost 70 gbytes of disk space (then the disk space has been > >> exhausted). > >> > >> These are some of the files I found inside: > >> 193M ./ontonet%3A%3Ainputstream%3Aontology889 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology1041 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology395 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology363 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology661 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology786 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology608 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology213 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology188 > >> 193M ./ontonet%3A%3Ainputstream%3Aontology602 > >> > >> > >> Any clues? > >> > >> Thanks, > >> David Riccitelli > >> > >> > ******************************************************************************** > >> InsideOut10 s.r.l. > >> P.IVA: IT-11381771002 > >> Fax: +39 0110708239 > >> --- > >> LinkedIn: http://it.linkedin.com/in/riccitelli > >> Twitter: ziodave > >> --- > >> Layar Partner Network< > http://www.layar.com/publishing/developers/list/?page=1&country=&city=&keyword=insideout10&lpn=1 > > > >> > ******************************************************************************** > > > >
