Andy, Thanks for the efforts to write this great explanation! IMHO, it is at least as valuable as a published paper.
Regards, Milorad >________________________________ > From: Andy Seaborne <a...@apache.org> >To: users@jena.apache.org >Sent: Tuesday, January 7, 2014 10:43 AM >Subject: Re: Regex performance in relation to the BGP > > >On 06/01/14 12:35, Saud Aljaloud wrote: >> Hi all, >> >> Happy new year Jena guys, I am working on the performance of regex in >> relation to triple stores. Since Jena is a good choice to inspect, I tried >> to dig in the code and debug how sparql works out regex. >> >> I'm using the latest source code >> https://svn.apache.org/repos/asf/jena/trunk/jena-arq/ >> >> I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check >> how many visits to the method match() by printing a counter inside it, >> I found out that the "order" of the query pattern within sparql can affect >> how many visits. >> >> For example, testing this turtle "example.ttl": >> <http://www.example.org/s1><http://www.example.org/p1> "triple1@en". >> <http://www.example.org/s1><http://www.example.org/p1> "triple2@en". >> <http://www.example.org/s1><http://www.example.org/p2> "triple3@en". >> <http://www.example.org/s2><http://www.example.org/p2> "triple4@en". >> <http://www.example.org/s2><http://www.example.org/p3> "triple5@en". >> <http://www.example.org/s2><http://www.example.org/p3> "triple6@en". >> >> >> With the usual model building: >> Model model = ModelFactory.createDefaultModel(); >> model.read("/Users/user/example.ttl"); >> Query query = QueryFactory.create(queryString) ; >> QueryExecution qexec = QueryExecutionFactory.create(query, model) ; >> ResultSet results = qexec.execSelect() ; >> ResultSetFormatter.out(System.out, results); >> >> >> >> And the sparql query: >> String queryString = "select * " + >> "where {" + >> "<http://www.example.org/s1> ?p ?o ." + >> "?s ?p ?o ." + >> "filter regex (?o, \"triple\")" + >> "}"; >> >> yields 3 visits, which is perfect, >> >> However, the query: >> String queryString = "select * " + >> "where {" + >> "?s ?p ?o ." + >> "<http://www.example.org/s1> ?p ?o ." + >> "filter regex (?o, \"triple\")" + >> "}"; >> >> Although both return the same result set, yet here, it prints 6 visits, >> >> Even narrowing the query to something like the below prints 6 visits: >> String queryString = "select * " + >> "where {" + >> "?s ?p ?o ." + >> "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." + >> "<http://www.example.org/s1> ?p ?o ." + >> "filter regex (?o, \"triple\")" + >> "}"; >> >> >> >> Looks like the filter regex is only looking for the first query pattern >> regardless the rest of the BGP. >> I'm not sure if this is the case for all other queries other than regex! >> >> >> Ideally, I think, sparql should reduce the set of tested triples, regardless >> of the order of the query patterns, and since regex by its nature >> slows down the performance, this will even add more overhead on the overall >> performance. >> >> Is this a bug, or am I missing something here? >> >> Cheers, >> Saud. > >Hi there, > >When an optimizer looks at that query, it is trying to find the most >selective patterns to start with. When a FILTER is involved, >selectivity estimation is hard and there are two alternatives of whether >to reorder the patterns, then place the filter, or place the filter then >reorder the patterns. > >The pattern here has the feature that one triple pattern is a subset of >the other. The ARQ optimizer does not look for that case - there's >nothing stopping that being done, it just isn't currently. > >In the case of in-memory, we also don't want the optimizer to spend too >long thinking. Queries can be short and specific, used at high >frequency. There is a real danger the optimizer will take longer than >executing the query if the optimizer tries to enumerate and weight all >possibilities. (Virtuoso wrote at length about this for the early BSBM >runs and they changed their optimizer from that experience.) The side >and negative effect is that how the query is written can affect performance. > >The Jena optimizer has two stages, and maybe this should be changed to >something richer. There is a stage before the query starts of high-level >optimization, which is rewriting the algebra expressions to "better" >ones. Any given storage layer can choose the most appropriate >transformations to apply (including bypassing this stage). The storage >execution can also choose to make storage specific changes, referred to >as low-level optimizations. Currently, that means reordering the >triples in a BGP because it needs the storage layer to keep the statistics. > >If filter placement is done early, in high-level ioptimization, it can >break up BGPs. > >You optimized (high level) query is: > >(sequence > (filter (regex ?o "triple") > (bgp (triple <http://www.example.org/s1> ?p ?o))) > (bgp (triple ?s ?p ?o))) > >[from arq.qparse --print=opt] > >and there is no single BGP to reorder any more. In the other order you get: > >(sequence > (filter (regex ?o "triple") > (bgp (triple ?s ?p ?o))) > (bgp (triple <http://www.example.org/s1> ?p ?o))) > >Whether starting with the filter is a good idea depends on the filter. >Determining the true selectivity of the filter is hard and the in-memory >system does not keep value frequency statistics so it does not know >whether a filter is highly selective of not. > >What would be better is a more integrated approach of alegbra reordering >and storage-level BGP reordering. At least reordering by how ground the >triple patterns are would be good (usually!) before doing filter >placement. > >Reordering by simply how grounded a triple pattern is, rather than using >data-specific statistics, which are quite some work to keep, is often >just as good - slightly irritating to anyone whose written a statistics >based reordering optimizer. > >The in-memory order of the unbroken BGP is by groudedness: > >(bgp > (triple ?s ?p ?o) > (triple <http://www.example.org/s1> ?p ?o) >) > >and the second pattern will be first (its more grounded). The filter >placement got in the way of that. > >TDB doesn't keep value-frequency stats either - it only uses triple >pattern counts. You need value frequency to know whether > >FILTER ( ?o < 56 ) is highly specific or not. It looks a lot like >FILTER ( ?o >= 56 ) until you know the distribution of ?o. > >TDB turns off the filter placement high-level optimization so that it >sees the the whole BGP, and then it itself places the filter. A >downside of that is that filter placement across non-BGPs is lost - the >codebase for the next release has a much improved filter placement >optimization step but it does not yet have the controls to do some >placement but delay filter/BGP placement. > >It has been reported that TDB, running in-memory, can be 50x faster than >the in-memory store. I suspect this is due to filter placement effects >as it does better than in-memory in combining filters and BGPs. > >You can try with.. > >Model model = TDBFactory.createDataset.getDefaultModel() ; >model.read ... > >About regxps: > >Regular expression matching can be costly but it's usually because the >regular expression itself has choices. Your example doesn't. As it's a >fixed string pattern, it is compiled once and used repeatedly. Matching >a fixed string isn't zero-cost but I would not have through it was very >large. If it had lots of meta-characters and disjunctions then the >execution can be costly. > >Regex is done with java.util.regex - there is alternative which is >pedantically more accurate based on some Xerces code (Java does not have >one of the flag equivalent of full XSD definition - "x" -which si to do >with whitespace mangling for readability). > >(PS languages are "triple1"@en, not "triple1@en" unless you have >rdf:PlainLiterals and technically they can't don't appear in RDF data). > >SPARQL does have CONTAINS: > >SELECT * { > > <http://www.example.org/s1> ?p ?o . > ?s ?p ?o . > FILTER CONTAINS(?o, "triple") >} > >So there current optimization framework is a pragmatic balance. It can >be made better. You could try out different strategies within the Jena >framework. > >You can install your own optimizer. The default one is just a >particular set of rewrites and the app can change that by installing its >own code. See Optimize.rewrite that implements Rewrite.rewrite. The >optimization step can be set globally, per dataset or per query execution. > > Andy > > > > > >