Hi Elli!

It seems that at least part of your problem is having duplicates in the LARQ index. Have you tried creating the Lucene index using the larqbuilder command line tool, instead of removing the index and just letting Fuseki rebuild it when it starts? See the end of my tutorial [1] for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates option, it will try to avoid duplicates in the index. Though the index building will take longer.

I've also noticed that it usually makes sense to place the pf:textMatch pattern first in the query, otherwise it will be executed many times and slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:

Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks 
Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena 
revision 1399877 (this past Friday morning's version of the trunk). I'm 
noticing the following anomaly when querying the data:

First I insert the following triples:
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
insert data {  graph <urn:test:foo> {
     <urn:test:s1> <urn:test:p1> "foo"^^xsd:string .
     <urn:test:s1> <urn:test:p2> "foo"^^xsd:string .
     <urn:test:s2> <urn:test:p3> "foo"^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an 
aside, I'd be very interested in a fix for this so I don't have to restart 
Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be 
able to work on it soon!) Once Fuseki is back up, I run the following query (I 
have default graph set as the union of named graphs by default):
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
select * where {
     <urn:test:s1> ?p ?lit .
   ?lit pf:textMatch "foo" . }

and I get 2 results as I expect:

--------------------------------------------------------------------
| p             | lit                                              |
====================================================================
| <urn:test:p1> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
| <urn:test:p2> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
--------------------------------------------------------------------
However, when I flip the order of my query like this:

PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
select * where {
   ?lit pf:textMatch "foo" .    <urn:test:s1> ?p ?lit .
I get 6 results, instead of the two I expect:

--------------------------------------------------------------------
| lit                                              | p             |
====================================================================
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
--------------------------------------------------------------------My guess as to what happens is that in the 
second query, first the query executer executes the first line (the ?lit pf:textMatch "foo") and this 
returns 3 results for foo, since there are 3 literals for "foo". Then, the next line of the query has 
three bindings to ?lit, so it produces the 6 results above (2 for each "foo" literal since there are 2 
properties for <urn:test:s1>). I know that I can avoid this by using a SELECT DISTINCT, but I still think the 
query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT 
query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd 
rather avoid).

Another point I've noticed is that in my other (much more complex) queries, 
against a much larger dataset (~1.5 million triples), if I put the pf:textMatch 
line anywhere but in the very beginning of the query, the query takes a VERY 
long time to execute. If I put it as the first line in the query, the query 
runs quickly. My guess for this is that the query is executed in order, and it 
takes much more work for the query executer to run the other parts of my query 
which contain many results, and then have to go back and essentially filter out 
those results where the literal doesn't match the pf:textMatch. I can always 
place the pf:textMatch line first, but then I'm back to the problem mentioned 
above where I get back too many duplicate results.

Thank you very much for your help!
-Elli

--
Osma Suominen | osma.suomi...@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research 
Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, 
Finland

Reply via email to