Hi all, Thanks a lot for your answers... I have "negotiated" with the admins of the project and I will be giving you examples of the queries and data ;)
We really need to enhance performance. BTW Virtuoso is good at inference or will I have the same issues? Thanks again. Regards, Jorge On 2017-10-11 15:47, Rob Vesse wrote: > Comments inline: > > On 11/10/2017 11:57, "George News" <george.n...@gmx.net> wrote: > > Hi all, > > The project I'm working in currently has a TDB with approximately 100M > triplets and the size is increasing quite quickly. When I make a typical > SPARQL query for getting data from the system, it takes ages, sometimes > more than 10-20 minutes. I think performance wise this is not really > user friendly. Therefore I need to know how I can increase the speed, etc. > > I'm running the whole system on a machine with Intel Xeon E312xx with > 32Gb RAM and many times I'm getting OutofMemory Exceptions and the > google.cache that Jena handles is the one that seems to be causing the > problem. > > Specifics stack traces would be useful to understand where the cache is > being exploded. Certain kinds of query may use the cache more heavily than > others so some elaboration on the general construction of queries would be > interesting. > > Are the figures I'm pointing normal (machine specs, response time, > etc.)? Is it too big/too small? > > The size of the data seems small relative to the size of the machine. You > don’t specify whether you change the JVM heap size, most memory usage in TDB > is off-heap via memory mapped files so setting too large a heap can > negatively impact performance. > > The response times seems very poor but that may be the nature of your > queries and data structure, however since you are unable to show those we can > only provide generalisations > > For the moment, we have decided to split the graph in pieces, that is, > generating a new named graph every now and then so the amount of > information stored in a "current" graph is smaller. Then restricting the > query to a set of graphs things work better. > > Although this solution works, when we merge the graphs for historical > queries, we are facing the same problem as before. Then, how can we > increased the speed? > > I cannot disclosed the dataset or part of it, but I will try to somehow > explain it. > > - Ids for entities are approximately 255 random ASCII characters. Does > the size of the ids affect the speed of the SPARQL queries? If yes, can > I apply a Lucene index to the IDs in order to reduce the query time? > > It depends on the nature of the query. All terms are mapped into 64-bit > internal identifiers, these are only mapped back to the original terms as and > when that query engine and/or results serialisation requires it. A cache is > used to speed up the mapping in both directions so depending on the nature of > the queries and your system loads you may be thrashing this cache. > > - The depth level of the graph or the information relationship is around > 7-8 level at most, but most of the times it is required to link 3-4 > levels. > > Difficult to say how this impacts performance because it really depends on > how you are querying that structure > > - Most of the queries include several: > ?x myont:hasattribute ?b. > ?a rdf:type ?b. > > Therefore checking the class and subclasses of entities. Is there anyway > to speed up the inference as if I'm asking for the parent class I will > get also the children ones defined in my ontology. > > So are you actively using inference? If you are then that will significantly > degrade performance because the inference closure is done entirely in memory > i.e. not in TDB if inference is turned on and you will get minimal > performance benefit from using TDB. > > If you only need simple inference like class and property hierarchy you may > be better served by asserting those statically using SPARQL updates and not > using dynamic inference > > - I know the "." in a query acts as more or less like an AND logical > operation. Does the order of sentences have implications in the > performance? Should I start with the most restrictive ones? Should I > start with the simplest ones, i.e. checking number values, etc.? > > yes and no. TDB Will attempt to do the necessary scans in an optimal order > based on its knowledge of the statistics of the data. However this only > applies within a single query pattern i.e. { } so depending on the structure > of your query you may need to do some manual reordering. Also if inference is > involved then that may interact. > > - Some of the queries uses spatial and time filtering? Is is worth > implementing the support for spatial searches with SPARQL? Is there any > kind of index for time searches? > > There is a geospatial indexing extension but there is no temporal indexing > provided by Jena. > > Any help is more than welcome. > > Without more detail it is difficult to provide more detailed help. > > Rob > > Regards, > Jorge > > > > > >