Hi all,

Thanks a lot for your answers... I have "negotiated" with the admins of
the project and I will be giving you examples of the queries and data ;)

We really need to enhance performance. BTW Virtuoso is good at inference
or will I have the same issues?

Thanks again.
Regards,
Jorge

On 2017-10-11 15:47, Rob Vesse wrote:
> Comments inline:
> 
> On 11/10/2017 11:57, "George News" <george.n...@gmx.net> wrote:
> 
>     Hi all,
>     
>     The project I'm working in currently has a TDB with approximately 100M
>     triplets and the size is increasing quite quickly. When I make a typical
>     SPARQL query for getting data from the system, it takes ages, sometimes
>     more than 10-20 minutes. I think performance wise this is not really
>     user friendly. Therefore I need to know how I can increase the speed, etc.
>     
>     I'm running the whole system on a machine with Intel Xeon E312xx with
>     32Gb RAM and many times I'm getting OutofMemory Exceptions and the
>     google.cache that Jena handles is the one that seems to be causing the
>     problem.
> 
>  Specifics stack traces would be useful to understand where the cache is 
> being exploded. Certain kinds of query may use the cache more heavily than 
> others so some elaboration on the general construction of queries would be 
> interesting.
>     
>     Are the figures I'm pointing normal (machine specs, response time,
>     etc.)? Is it too big/too small?
> 
>  The size of the data seems small relative to the size of the machine. You 
> don’t specify whether you change the JVM heap size, most memory usage in TDB 
> is off-heap via memory mapped files so setting too large a heap can 
> negatively impact performance.
> 
>  The response times seems very poor but that may be the nature of your 
> queries and data structure, however since you are unable to show those we can 
> only provide generalisations
>     
>     For the moment, we have decided to split the graph in pieces, that is,
>     generating a new named graph every now and then so the amount of
>     information stored in a "current" graph is smaller. Then restricting the
>     query to a set of graphs things work better.
>     
>     Although this solution works, when we merge the graphs for historical
>     queries, we are facing the same problem as before. Then, how can we
>     increased the speed?
>     
>     I cannot disclosed the dataset or part of it, but I will try to somehow
>     explain it.
>     
>     - Ids for entities are approximately 255 random ASCII characters. Does
>     the size of the ids affect the speed of the SPARQL queries? If yes, can
>     I apply a Lucene index to the IDs in order to reduce the query time?
> 
>  It depends on the nature of the query. All terms are mapped into 64-bit 
> internal identifiers, these are only mapped back to the original terms as and 
> when that query engine and/or results serialisation requires it.  A cache is 
> used to speed up the mapping in both directions so depending on the nature of 
> the queries and your system loads you may be thrashing this cache.
>     
>     - The depth level of the graph or the information relationship is around
>     7-8 level at most, but most of the times it is required to link 3-4 
> levels.
> 
>   Difficult to say how this impacts performance because it really depends on 
> how you are querying that structure
>     
>     - Most of the queries include several:
>     ?x myont:hasattribute ?b.
>     ?a rdf:type ?b.
>     
>     Therefore checking the class and subclasses of entities. Is there anyway
>     to speed up the inference as if I'm asking for the parent class I will
>     get also the children ones defined in my ontology.
> 
> So are you actively using inference? If you are then that will significantly 
> degrade performance because the inference closure is done entirely in memory 
> i.e. not in TDB if inference is turned on and you will get minimal 
> performance benefit from using TDB.
> 
>  If you only need simple inference like class and property hierarchy you may 
> be better served by asserting those statically using SPARQL updates and not 
> using dynamic inference
>     
>     - I know the "." in a query acts as more or less like an AND logical
>     operation. Does the order of sentences have implications in the
>     performance? Should I start with the most restrictive ones? Should I
>     start with the simplest ones, i.e. checking number values, etc.?
> 
>  yes and no.  TDB Will attempt to do the necessary scans in an optimal order 
> based on its knowledge of the statistics of the data. However this only 
> applies within a single query pattern i.e. { } so depending on the structure 
> of your query you may need to do some manual reordering. Also if inference is 
> involved then that may interact.
>     
>     - Some of the queries uses spatial and time filtering? Is is worth
>     implementing the support for spatial searches with SPARQL? Is there any
>     kind of index for time searches?
> 
>  There is a geospatial indexing extension but there is no temporal indexing 
> provided by Jena.
>     
>     Any help is more than welcome.
> 
>  Without more detail it is difficult to provide more detailed help.
> 
> Rob
>     
>     Regards,
>     Jorge
>     
> 
> 
> 
> 
> 

Reply via email to