RE: [topbraid-users] Re: Accessing large graphs through TBC

Schmitz, Jeffrey A Tue, 01 Jun 2010 07:25:15 -0700

On a related note, but more specific to Topbraid, I noticed that working on a 
large model (about 2,000,000 triples), Topbraid seems to work a little 
differently.  For example, when opened in Topbraid,  it doesn't seem to perform 
the normal "default" inferencing it usually does automatically for smaller 
models?   For example, class/subclass inferencing isn't performed, nor do the 
instance counts exist.  Is the correct?  I'm guessing the answer as to why this 
would be is that to try and do otherwise would simply be too memory intensive.
As I'm working with larger and larger models I'm starting to come to the 
conclusion that if you have models of any size (say on the order of 1,000,000 
triples) and you want to do any inferencing (even OWL_MEM) you will need a 64 
bit machine.  Would you say this is a fair statement?


Thanks,
Jeff

> -----Original Message-----
> From: [email protected] [mailto:topbraid-
> [email protected]] On Behalf Of Schmitz, Jeffrey A
> Sent: Tuesday, June 01, 2010 7:48 AM
> To: TopBraid Suite Users
> Subject: [topbraid-users] Re: Accessing large graphs through TBC
>
> Thanks Holger,
>    This has become a REALLY cogent point for us, and your great
> writeup here I think belongs front and center in Topbraid's, and maybe
> more so in Jena's documentation.  For me anyway, it's very easy to get
> lost in all the different permutations of model views (e.g. base
> models, and all the myriad kinds of inference models and how each is
> handled by the myriad kinds of underlying databases) that we work on
> both within Topbraid an when using the SPIN and Jena API that we
> quickly lose track of some REALLY important issues related to what's
> really going on under the hood, espcially as both Topbraid and the
> Jena API (much to their credit) do their best to abstract these
> details away from the cognizance of their users.   But understanding
> for example, when seemingly simple changes to code or simple
> selections on the TBC UI is going to cause possibly huge union graphs
> to be automatically read completely into memory can be REALLY
> important.
>
> Anyway, I've got lots of more detailed questions and curiosities about
> this subject, but more than can probably be answered here or even on
> the Jena board.  Would be great if there were some kind of discussion
> group at semtech setup (hint hint) hitting on some of these issues and
> different approaches that have been used to more scalably use the
> combined power of Jena/SPARQL/SPIN/TOPSPIN and other inference engines
> in the most scalable fashion possible.  Again, this is really more a
> Jena issue than a Topbraid, but my guess is some of the most
> knowledgeable and experienced users of Jena reside at Topbraid.
>
> On May 28, 6:01 pm, Holger Knublauch <[email protected]> wrote:
> > Here's another hint: be careful which Graph you are targeting. If you
> just press Run Query in TBC's SPARQL view, then the query will execute
> over the current graph. This is however a union graph of the base graph
> + imports + inferences + system triples. Whenever multiple graphs are
> involved, the query engine will not be able to exploit any native
> optimizations of the database (Jena: QueryHandler). As a result,
> complex basic graph patterns will need to be split into many small SPO
> requests, and this can have a significant performance penalty.
> >
> > To verify this, check the button in the SPARQL view to run the query
> on the Base Model only. If your main file is a TDB then Jena can send
> the whole query to the database where it will be handled natively (and
> likely much faster than having to merge multiple sub-graphs).
> >
> > To enforce this in SPIN and elsewhere, you can fine tune which graphs
> you are targeting, using the GRAPH keyword in SPARQL. Use a construct
> such as
> >
> > ...
> > WHERE {
> >     GRAPH <http://example.org/myMainGraph> {
> >         # All the heavy lifting goes here
> >     }
> >
> > }
> >
> > to precisely select what graph to operate on. From 3.4 onwards we
> have introduced a special named graph called <urn:tb:sessionbase> that
> will be the base graph of the current file only.
> >
> > If this issue impacts you (or others), then I could think about how
> to generalize this in SPIN, to make sure that constraints and rules are
> only executed on the base model.
> >
> > Viele Gruesse
> > Holger
> >
> > On May 29, 2010, at 5:42 AM, Scott Henninger wrote:> Bad typo that
> needs fixing.  Let me try that again:
> >
> > > Oh, another one is that post-processing of SELECT results can be
> > > expensive *if* the result set is large.  I.e. ORDER BY, GROUP BY,
> > > aggregates (count, max, min), etc.
> >
> > > -- Scott
> >
> > > On May 28, 2:36 pm, Scott Henninger <[email protected]>
> > > wrote:
> > >> Oh, another one is that post-processing of SELECT results can be
> > >> expensive is the result set is large.  I.e. ORDER BY, GROUP BY,
> > >> aggregates (count, max, min), etc.
> >
> > >> -- Scott
> >
> > >> On May 28, 2:12 pm, Christian Fuerber <[email protected]> wrote:
> >
> > >>> Scott, thank you so much for your advice! I will try to optimize
> my
> > >>> queries, but I'm also eager for more performance tuning hints.
> >
> > >>> Cheers,
> >
> > >>> Christian
> >
> > >>> On 28 Mai, 20:51, Scott Henninger <[email protected]>
> wrote:
> >
> > >>>> Christian; In the end it's really no different than any kind of
> > >>>> database application.  There will always be a point in
> whichperformancesuffers and/or memory becomes scarce.  There are also
> > >>>> tradeoffs for large heap spaces - garbage collection overhead
> > >>>> increases with the size of memory you are trying collect garbage
> on.
> >
> > >>>> So the first set of advice I'd give is to look carefully at your
> > >>>> queries.  Keep in mind that ?s ?p ?o says "bring everything into
> > >>>> memory", thus defeating the purpose of using a back-end.
>  Fortunately
> > >>>> Composer's SPARQL engine, ARQ, tends to ignore these.  But the
> advice
> > >>>> is still sage - cut down the size of the graph match as soon as
> > >>>> possible.  Note that OPTIONAL is fairly dangerous to start with
> - it
> > >>>> states "find this graph pattern or that graph pattern" and thus
> > >>>> increases the search space.  We are working to optimize this
> better.
> >
> > >>>> Turn on Profiling Mode in the SPARQL View's query debugger and
> find
> > >>>> the triple pattern placements that result in the fewest match
> attempts
> > >>>> - i.e. use it to make sure you prune the search space optimally.
>  It
> > >>>> will also give you the true ordering of evaluation, which is
> normally
> > >>>> top-to-bottom, but ARQ will perform some optimizations.  Try
> turning
> > >>>> on filter placement and filter early, if that's possible.  Use
> LIMIT
> > >>>> and OFFSET to experiment with pieces of the results.
> >
> > >>>> Avoid string matches as much as possible - FILTERs are often
> deadly in
> > >>>> this respect, so make sure you cut down your search space as
> much as
> > >>>> possible before doing this kind of string compare.  Note this
> nearly
> > >>>> conflicts with the earlier statement.  Where to place filters
> depends
> > >>>> on how computationally expensive the filter expression is and
> how much
> > >>>> it prunes the search space for downstream processing.
> >
> > >>>> There are probably a few other SPARQL query tips and I'd love to
> hear
> > >>>> from others on this. (Self-plug alert: We cover SPARQL
> queryperformanceour Advanced Product training.)
> >
> > >>>> Divide-and-conquer is the next step.  View your data back-end as
> a
> > >>>> huge well and you only want a specific bucketful at a time,
> being
> > >>>> careful to leave the rest where it is.  I.e. view heap space
> memory as
> > >>>> a limited resource.  SPARQLMotion is an excellent tool for this
> task
> > >>>> as it can be used to get specific sets of triples that can be
> combined
> > >>>> for further processing.  Remember that ASK queries terminate
> when a
> > >>>> match is found, so if you just need to find if the graph pattern
> > >>>> exists, ASK is much more efficient than SELECT.
> >
> > >>>> For SPIN the advice is pretty much the same (Holger can say
> more).  Be
> > >>>> sure to use ?this because the engine is optimized for that pre-
> bound
> > >>>> variable.  SPIN will iterate until no new triples are found
> unless you
> > >>>> tell the engine otherwise.  Turn iterations off if that makes
> sense
> > >>>> for your application.  If some rules don't have value in
> forward-
> > >>>> chaining, set the rule property's
> spin:rulePropertyMaxIterationCount
> > >>>> to the maximum needed.
> >
> > >>>> Many things to try here and I hope this isn't mistaken as being
> the
> > >>>> last word on this topic!
> >
> > >>>> -- Scott
> >
> > >>>> On May 28, 12:36 pm, Christian Fuerber <[email protected]> wrote:
> >
> > >>>>> Hi Holger,
> >
> > >>>>> thank you for the information. I finally set up a TDB
> triplestore with
> > >>>>> over 53 Mio triples and it shows up perfectly in TBC. But now
> I'm
> > >>>>> havingperformanceproblems when running SPARQL queries over the
> > >>>>> triplestore in TBC's SPARQL view. SPIN constraints also seem to
> run
> > >>>>> forever. I'm using the 64bit version of TBC 3.3.1 SE and set
> the java
> > >>>>> heapspace limit to 6144m.
> >
> > >>>>> Is there any configurations the can speed up TBC's queries and
> SPIN
> > >>>>> constraints besides query optimization and java heapspace
> settings?
> >
> > >>>>> Thanks,
> > >>>>> Christian
> >
> > >>>>> On 24 Mai, 03:51, Holger Knublauch <[email protected]>
> wrote:
> >
> > >>>>>> Hi Christian,
> >
> > >>>>>> TopBraid's Sesame 2 integration is not as optimized as our
> other database interfaces. In practice this means that TBC cannot use
> any native optimizations that the database server (in your case
> Virtuoso) may provide. Instead, it will break down even the most
> complex SPARQL queries into small findSPO queries, which may lead to
> significantperformanceproblems. Maybe that's why constraint checking
> apparently did not work for you when you can SPIN constraints on
> Virtuoso. With smaller databases this impact may not have been
> sufficiently severe to notice. But I am glad that you have been able to
> confirm that Virtuoso is working well with TBC in principle.
> >
> > >>>>>> Over the week end we also had some enlightening examples of
> SPARQL queries that were not optimized: we had a query with three
> OPTIONAL clauses, leading to a large number of potential combinations.
> Replacing those with other SPARQL patterns were leading to many orders
> of magnitude speed improvements. Just saying this in case you have some
> "dangerous" queries in your constraints. I assume you have executed the
> queries individually, e.g. from the SPARQL view, to test
> theirperformancebefore putting them into your SPIN constraint library.
> >
> > >>>>>> Finally, I discovered aperformanceissue after running the SPIN
> constraints from the Problems View: if hundreds or thousands of
> violations were found, then just updating those into the Problems view
> may freeze the system. I have just fixed this for 3.3.2 (and 3.4).
> >
> > >>>>>> Regards,
> > >>>>>> Holger
> >
> > >>>>>> On May 21, 2010, at 10:25 PM, Christian Fuerber wrote:
> >
> > >>>>>>> Hi Holger,
> >
> > >>>>>>> fortunately i can apply SPIN constraints now on the small
> graph (245
> > >>>>>>> triples) in virtuoso. Although I still do not know what the
> problem
> > >>>>>>> was. Maybe. it works now because I reinstalled java. I will
> also send
> > >>>>>>> you the error log, in case you are interested.
> >
> > >>>>>>> Thanks a lot,
> >
> > >>>>>>> Christian
> >
> > >>>>>>> On 21 Mai, 00:20, Holger Knublauch <[email protected]>
> wrote:
> > >>>>>>>> On May 21, 2010, at 1:08 AM, Christian Fuerber wrote:
> >
> > >>>>>>>>> Hi Holger,
> >
> > >>>>>>>>> thank you for the quick response. Yes, I could successfully
> connect to
> > >>>>>>>>> a graph in virtuoso that has 245 triples. But the SPIN
> constraint
> > >>>>>>>>> checks are not working on the graph's data. I receive an
> error "Could
> > >>>>>>>>> not run checker" when executing "Refresh and show problems
> > >>>>>>>>> (constraints)". SPARQL in TBC is also not working. I just
> can see the
> > >>>>>>>>> classes and instances in the editor.
> >
> > >>>>>>>> Are there any more details available, e.g. the Error Log?
> >
> > >>>>>>>> And yes, TDB will almost certainly be faster for SPARQL,
> because it will "live" in the same JVM, so no communication overhead is
> needed. Furthermore, TDB is better optimized to work with the ARQ
> SPARQL engine.
> >
> > >>>>>>>> Thanks
> > >>>>>>>> Holger
> >
> > >>>>>>>> --
> > >>>>>>>> You received this message because you are subscribed to the
> Google
> > >>>>>>>> Group "TopBraid Suite Users", the topics of which include
> TopBraid Composer,
> > >>>>>>>> TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
> > >>>>>>>> To post to this group, send email to
> > >>>>>>>> [email protected]
> > >>>>>>>> To unsubscribe from this group, send email to
> > >>>>>>>> [email protected]
> > >>>>>>>> For more options, visit this group
> athttp://groups.google.com/group/topbraid-users?hl=en
> >
> > >>>>>>> --
> > >>>>>>> You received this message because you are subscribed to the
> Google
> > >>>>>>> Group "TopBraid Suite Users", the topics of which include
> TopBraid Composer,
> > >>>>>>> TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
> > >>>>>>> To post to this group, send email to
> > >>>>>>> [email protected]
> > >>>>>>> To unsubscribe from this group, send email to
> > >>>>>>> [email protected]
> > >>>>>>> For more options, visit this group at
> > >>>>>>>http://groups.google.com/group/topbraid-users?hl=en
> >
> > >>>>>> --
> > >>>>>> You received this message because you are subscribed to the
> Google
> > >>>>>> Group "TopBraid Suite Users", the topics of which include
> TopBraid Composer,
> > >>>>>> TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
> > >>>>>> To post to this group, send email to
> > >>>>>> [email protected]
> > >>>>>> To unsubscribe from this group, send email to
> > >>>>>> [email protected]
> > >>>>>> For more options, visit this group
> athttp://groups.google.com/group/topbraid-users?hl=en
> >
> > > --
> > > You received this message because you are subscribed to the Google
> > > Group "TopBraid Suite Users", the topics of which include TopBraid
> Composer,
> > > TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
> > > To post to this group, send email to
> > > [email protected]
> > > To unsubscribe from this group, send email to
> > > [email protected]
> > > For more options, visit this group at
> > >http://groups.google.com/group/topbraid-users?hl=en
>
> --
> You received this message because you are subscribed to the Google
> Group "TopBraid Suite Users", the topics of which include TopBraid
> Composer,
> TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
> To post to this group, send email to
> [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/topbraid-users?hl=en

-- 
You received this message because you are subscribed to the Google
Group "TopBraid Suite Users", the topics of which include TopBraid Composer,
TopBraid Live, TopBraid Ensemble, SPARQLMotion and SPIN.
To post to this group, send email to
[email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/topbraid-users?hl=en

RE: [topbraid-users] Re: Accessing large graphs through TBC

Reply via email to