+1 to all Rob's comments but especially about inference.

Without inference we routinely handle 100MT+ datasets on 8GB machines in production with interactive query performance. Though of course the details of the data and query are critical.

I would not advise live inference in a production system with Jena. Materialize your inferences one way or another (though rules or sparql updates) and run your queries over a plain TDB with no inference, or rewrite your queries to do the inference in the query if possible.

Dave

On 11/10/17 14:47, Rob Vesse wrote:
Comments inline:

On 11/10/2017 11:57, "George News" <george.n...@gmx.net> wrote:

     Hi all,
The project I'm working in currently has a TDB with approximately 100M
     triplets and the size is increasing quite quickly. When I make a typical
     SPARQL query for getting data from the system, it takes ages, sometimes
     more than 10-20 minutes. I think performance wise this is not really
     user friendly. Therefore I need to know how I can increase the speed, etc.
I'm running the whole system on a machine with Intel Xeon E312xx with
     32Gb RAM and many times I'm getting OutofMemory Exceptions and the
     google.cache that Jena handles is the one that seems to be causing the
     problem.

  Specifics stack traces would be useful to understand where the cache is being 
exploded. Certain kinds of query may use the cache more heavily than others so 
some elaboration on the general construction of queries would be interesting.
Are the figures I'm pointing normal (machine specs, response time,
     etc.)? Is it too big/too small?

  The size of the data seems small relative to the size of the machine. You 
don’t specify whether you change the JVM heap size, most memory usage in TDB is 
off-heap via memory mapped files so setting too large a heap can negatively 
impact performance.

  The response times seems very poor but that may be the nature of your queries 
and data structure, however since you are unable to show those we can only 
provide generalisations
For the moment, we have decided to split the graph in pieces, that is,
     generating a new named graph every now and then so the amount of
     information stored in a "current" graph is smaller. Then restricting the
     query to a set of graphs things work better.
Although this solution works, when we merge the graphs for historical
     queries, we are facing the same problem as before. Then, how can we
     increased the speed?
I cannot disclosed the dataset or part of it, but I will try to somehow
     explain it.
- Ids for entities are approximately 255 random ASCII characters. Does
     the size of the ids affect the speed of the SPARQL queries? If yes, can
     I apply a Lucene index to the IDs in order to reduce the query time?

  It depends on the nature of the query. All terms are mapped into 64-bit 
internal identifiers, these are only mapped back to the original terms as and 
when that query engine and/or results serialisation requires it.  A cache is 
used to speed up the mapping in both directions so depending on the nature of 
the queries and your system loads you may be thrashing this cache.
- The depth level of the graph or the information relationship is around
     7-8 level at most, but most of the times it is required to link 3-4 levels.

   Difficult to say how this impacts performance because it really depends on 
how you are querying that structure
- Most of the queries include several:
     ?x myont:hasattribute ?b.
     ?a rdf:type ?b.
Therefore checking the class and subclasses of entities. Is there anyway
     to speed up the inference as if I'm asking for the parent class I will
     get also the children ones defined in my ontology.

So are you actively using inference? If you are then that will significantly 
degrade performance because the inference closure is done entirely in memory 
i.e. not in TDB if inference is turned on and you will get minimal performance 
benefit from using TDB.

  If you only need simple inference like class and property hierarchy you may 
be better served by asserting those statically using SPARQL updates and not 
using dynamic inference
- I know the "." in a query acts as more or less like an AND logical
     operation. Does the order of sentences have implications in the
     performance? Should I start with the most restrictive ones? Should I
     start with the simplest ones, i.e. checking number values, etc.?

  yes and no.  TDB Will attempt to do the necessary scans in an optimal order 
based on its knowledge of the statistics of the data. However this only applies 
within a single query pattern i.e. { } so depending on the structure of your 
query you may need to do some manual reordering. Also if inference is involved 
then that may interact.
- Some of the queries uses spatial and time filtering? Is is worth
     implementing the support for spatial searches with SPARQL? Is there any
     kind of index for time searches?

  There is a geospatial indexing extension but there is no temporal indexing 
provided by Jena.
Any help is more than welcome.

  Without more detail it is difficult to provide more detailed help.

Rob
Regards,
     Jorge



Reply via email to