Re: Regex performance in relation to the BGP

Milorad Tosic Tue, 07 Jan 2014 03:31:22 -0800

Andy,

Thanks for the efforts to write this great explanation! IMHO, it is at least as 
valuable as a published paper.


Regards,
Milorad




>________________________________
> From: Andy Seaborne <a...@apache.org>
>To: users@jena.apache.org 
>Sent: Tuesday, January 7, 2014 10:43 AM
>Subject: Re: Regex performance in relation to the BGP
> 
>
>On 06/01/14 12:35, Saud Aljaloud wrote:
>> Hi all,
>>
>> Happy new year Jena guys, I am working on the performance of regex in 
>> relation to triple stores. Since Jena is a good choice to inspect, I tried 
>> to dig in the code and debug how sparql works out regex.
>>
>> I'm using the latest source code 
>> https://svn.apache.org/repos/asf/jena/trunk/jena-arq/
>>
>> I tried to debug the class com.hp.hpl.jena.sparql.expr.RegexJava and check 
>> how many visits to the method match() by printing a counter inside it,
>> I found out that the "order" of the query pattern within sparql can affect 
>> how many visits.
>>
>> For example, testing this turtle "example.ttl":
>> <http://www.example.org/s1><http://www.example.org/p1> "triple1@en".
>> <http://www.example.org/s1><http://www.example.org/p1> "triple2@en".
>> <http://www.example.org/s1><http://www.example.org/p2> "triple3@en".
>> <http://www.example.org/s2><http://www.example.org/p2> "triple4@en".
>> <http://www.example.org/s2><http://www.example.org/p3> "triple5@en".
>> <http://www.example.org/s2><http://www.example.org/p3> "triple6@en".
>>
>>
>> With the usual model building:
>> Model model = ModelFactory.createDefaultModel();
>> model.read("/Users/user/example.ttl");
>> Query query = QueryFactory.create(queryString) ;
>> QueryExecution qexec = QueryExecutionFactory.create(query, model) ;
>> ResultSet results = qexec.execSelect() ;
>> ResultSetFormatter.out(System.out, results);
>>
>>
>>
>> And the sparql query:
>> String queryString = "select * " +
>>         "where {" +
>>         "<http://www.example.org/s1> ?p ?o ." +
>>         "?s ?p ?o ." +
>>         "filter regex (?o, \"triple\")" +
>>         "}";
>>
>> yields 3 visits, which is perfect,
>>
>> However, the query:
>> String queryString = "select * " +
>>         "where {" +
>>         "?s ?p ?o ." +
>>         "<http://www.example.org/s1> ?p ?o ." +
>>         "filter regex (?o, \"triple\")" +
>>         "}";
>>
>> Although both return the same result set, yet here, it prints 6 visits,
>>
>> Even narrowing the query to something like the below prints 6 visits:
>> String queryString = "select * " +
>>         "where {" +
>>         "?s ?p ?o ." +
>>         "<http://www.example.org/s1> <http://www.example.org/p1> ?o ." +
>>         "<http://www.example.org/s1> ?p ?o ." +
>>         "filter regex (?o, \"triple\")" +
>>         "}";
>>
>>
>>
>> Looks like the filter regex is only looking for the first query pattern 
>> regardless the rest of the BGP.
>> I'm not sure if this is the case for all other queries other than regex!
>>
>>
>> Ideally, I think, sparql should reduce the set of tested triples, regardless 
>> of the order of the query patterns, and since regex by its nature
>> slows down the performance, this will even add more overhead on the overall 
>> performance.
>>
>> Is this a bug, or am I missing something here?
>>
>> Cheers,
>> Saud.
>
>Hi there,
>
>When an optimizer looks at that query, it is trying to find the most 
>selective patterns to start with.  When a FILTER is involved, 
>selectivity estimation is hard and there are two alternatives of whether 
>to reorder the patterns, then place the filter, or place the filter then 
>reorder the patterns.
>
>The pattern here has the feature that one triple pattern is a subset of 
>the other. The ARQ optimizer does not look for that case - there's 
>nothing stopping that being done, it just isn't currently.
>
>In the case of in-memory, we also don't want the optimizer to spend too 
>long thinking.  Queries can be short and specific, used at high 
>frequency. There is a real danger the optimizer will take longer than 
>executing the query if the optimizer tries to enumerate and weight all 
>possibilities.  (Virtuoso wrote at length about this for the early BSBM 
>runs and they changed their optimizer from that experience.)  The side 
>and negative effect is that how the query is written can affect performance.
>
>The Jena optimizer has two stages, and maybe this should be changed to 
>something richer. There is a stage before the query starts of high-level 
>optimization, which is rewriting the algebra expressions to "better" 
>ones.  Any given storage layer can choose the most appropriate 
>transformations to apply (including bypassing this stage).  The storage 
>execution can also choose to make storage specific changes, referred to 
>as low-level optimizations.  Currently, that means reordering the 
>triples in a BGP because it needs the storage layer to keep the statistics.
>
>If filter placement is done early, in high-level ioptimization, it can 
>break up BGPs.
>
>You optimized (high level) query is:
>
>(sequence
>   (filter (regex ?o "triple")
>     (bgp (triple <http://www.example.org/s1> ?p ?o)))
>   (bgp (triple ?s ?p ?o)))
>
>[from arq.qparse --print=opt]
>
>and there is no single BGP to reorder any more.  In the other order you get:
>
>(sequence
>   (filter (regex ?o "triple")
>     (bgp (triple ?s ?p ?o)))
>   (bgp (triple <http://www.example.org/s1> ?p ?o)))
>
>Whether starting with the filter is a good idea depends on the filter. 
>Determining the true selectivity of the filter is hard and the in-memory 
>system does not keep value frequency statistics so it does not know 
>whether a filter is highly selective of not.
>
>What would be better is a more integrated approach of alegbra reordering 
>and storage-level BGP reordering.  At least reordering by how ground the 
>triple patterns are would be good (usually!) before doing filter 
>placement.
>
>Reordering by simply how grounded a triple pattern is, rather than using 
>data-specific statistics, which are quite some work to keep, is often 
>just as good - slightly irritating to anyone whose written a statistics 
>based reordering optimizer.
>
>The in-memory order of the unbroken BGP is by groudedness:
>
>(bgp
>   (triple ?s ?p ?o)
>   (triple <http://www.example.org/s1> ?p ?o)
>)
>
>and the second pattern will be first (its more grounded).  The filter 
>placement got in the way of that.
>
>TDB doesn't keep value-frequency stats either - it only uses triple 
>pattern counts. You need value frequency to know whether
>
>FILTER ( ?o < 56 ) is highly specific or not.  It looks a lot like
>FILTER ( ?o >= 56 ) until you know the distribution of ?o.
>
>TDB turns off the filter placement high-level optimization so that it 
>sees the the whole BGP, and then it itself places the filter.  A 
>downside of that is that filter placement across non-BGPs is lost - the 
>codebase for the next release has a much improved filter placement 
>optimization step but it does not yet have the controls to do some 
>placement but delay filter/BGP placement.
>
>It has been reported that TDB, running in-memory, can be 50x faster than 
>the in-memory store. I suspect this is due to filter placement effects 
>as it does better than in-memory in combining filters and BGPs.
>
>You can try with..
>
>Model model = TDBFactory.createDataset.getDefaultModel() ;
>model.read ...
>
>About regxps:
>
>Regular expression matching can be costly but it's usually because the 
>regular expression itself has choices.  Your example doesn't.  As it's a 
>fixed string pattern, it is compiled once and used repeatedly.  Matching 
>a fixed string isn't zero-cost but I would not have through it was very 
>large.  If it had lots of meta-characters and disjunctions then the 
>execution can be costly.
>
>Regex is done with java.util.regex - there is alternative which is 
>pedantically more accurate based on some Xerces code (Java does not have 
>one of the flag equivalent of full XSD definition - "x" -which si to do 
>with whitespace mangling for readability).
>
>(PS languages are "triple1"@en, not "triple1@en" unless you have 
>rdf:PlainLiterals and technically they can't don't appear in RDF data).
>
>SPARQL does have CONTAINS:
>
>SELECT * {
>
>   <http://www.example.org/s1> ?p ?o .
>   ?s ?p ?o .
>   FILTER CONTAINS(?o, "triple")
>}
>
>So there current optimization framework is a pragmatic balance.  It can 
>be made better.  You could try out different strategies within the Jena 
>framework.
>
>You can install your own optimizer.  The default one is just a 
>particular set of rewrites and the app can change that by installing its 
>own code.  See Optimize.rewrite that implements Rewrite.rewrite.  The 
>optimization step can be set globally, per dataset or per query execution.
>
>    Andy
>
>
>
>
>
>

Re: Regex performance in relation to the BGP

Reply via email to