Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

On 24/10/12 12:11, Osma Suominen wrote:

Hi Elli!

It seems that at least part of your problem is having duplicates in the
LARQ index. Have you tried creating the Lucene index using the
larqbuilder command line tool, instead of removing the index and just
letting Fuseki rebuild it when it starts? See the end of my tutorial [1]
for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates
option, it will try to avoid duplicates in the index. Though the index
building will take longer.


Exactly.

Duplicate removal slow down indexing. In you want to index a large 
dataset you want to disable it and go faster.


Maybe that option should be renamed. Proposal?

Paolo



I've also noticed that it usually makes sense to place the pf:textMatch
pattern first in the query, otherwise it will be executed many times and
slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:


Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions -
thanks Osma!)  where I recompiled Jena (after adding the Larq
dependency) to Jena revision 1399877 (this past Friday morning's
version of the trunk). I'm noticing the following anomaly when
querying the data:

First I insert the following triples:
prefix xsd: http://www.w3.org/2001/XMLSchema#
insert data {  graph urn:test:foo {
 urn:test:s1 urn:test:p1 foo^^xsd:string .
 urn:test:s1 urn:test:p2 foo^^xsd:string .
 urn:test:s2 urn:test:p3 foo^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As
an aside, I'd be very interested in a fix for this so I don't have to
restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping
someone will be able to work on it soon!) Once Fuseki is back up, I
run the following query (I have default graph set as the union of
named graphs by default):
PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
 urn:test:s1 ?p ?lit .
 ?lit pf:textMatch foo . }

and I get 2 results as I expect:


| p | lit  |

| urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
| urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |

However, when I flip the order of my query like this:

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
 ?lit pf:textMatch foo .  urn:test:s1 ?p ?lit .
I get 6 results, instead of the two I expect:


| lit  | p |

| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
My
guess as to what happens is that in the second query, first the query
executer executes the first line (the ?lit pf:textMatch foo) and
this returns 3 results for foo, since there are 3 literals for foo.
Then, the next line of the query has three bindings to ?lit, so it
produces the 6 results above (2 for each foo literal since there are
2 properties for urn:test:s1). I know that I can avoid this by using
a SELECT DISTINCT, but I still think the query shouldn't produce
different results based on switching the order. Additionally, if I put
this in a CONSTRUCT query, I can't use DISTINCT to eliminate the
duplicate results (unless I use a SELECT DISTINCT subquery which I'd
rather avoid).

Another point I've noticed is that in my other (much more complex)
queries, against a much larger dataset (~1.5 million triples), if I
put the pf:textMatch line anywhere but in the very beginning of the
query, the query takes a VERY long time to execute. If I put it as the
first line in the query, the query runs quickly. My guess for this is
that the query is executed in order, and it takes much more work for
the query executer to run the other parts of my query which contain
many results, and then have to go back and essentially filter out
those results where the literal doesn't match the pf:textMatch. I can
always place the pf:textMatch line first, but then I'm back to the
problem mentioned above where I get back too many duplicate results.

Thank you very much for your help!
-Elli






Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

Hi Osma, hi Elli

On 02/11/12 10:34, Osma Suominen wrote:

Hi Elli!

[apparently your reply didn't come through the mailing list, but this
one should]

31.10.2012 23:11, Elli Schwarz kirjoitti:

Thank you for the tip. Yes, if I generate the index using the
larqbuilder command, I don't get the duplicates in the query, regardless
of the placement of the pf:testMatch line. (As an aside, why does the
default behavior of creating the index allow duplicates, but the default
of the larqbuilder command does not?)


Good to hear that eliminating duplicates works for you. I have no idea
why the defaults are as they are.


LARQ index 'text' -- RDF nodes, see in IndexBuilderNode.java:

public void index(Node node, String indexStr)
{
try {
if ( avoidDuplicates() ) unindex(node, indexStr);
Document doc = new Document() ;
LARQ.store(doc, node) ;
LARQ.index(doc, node, indexStr) ;
getIndexWriter().addDocument(doc) ;
} catch (IOException ex)
{ throw new ARQLuceneException(index, ex) ; }
}

avoidDuplicates() by default returns 'true' and by default we want to 
avoid duplicates and make the Lucene index smaller.


if ( avoidDuplicates() ) unindex(node, indexStr); is 'ugly' and 
inefficient, but it is done to avoid having useless documents in the 
Lucene index, as you might have exactly the same RDF node/literal used 
in many triples.


I am open to better suggestions to make this better or faster.


However, switching the order of where I place the pf:textMatch line
(while it may slow down the query), should not produce different
results, even if there are duplicates in the index. This would appear to
be a bug in how Larq applies the results of the index lookup to the
query.


Elli, could you provide an example with some data and your query?


I'm not sure whether getting or not getting duplicates in specific
situations can be considered a bug. But yes, the implementation of LARQ
seems to be rather simplistic. It might help if the raw index results
were filtered to weed out duplicates before applying them to the query.


How could we do this?


Then the choice whether to try to avoid duplicates during indexing would
only be an optimization issue.

BTW I'm not (so far) a LARQ developer, just a fellow user..


But you could help out with LARQ (if you are using it!).
Patches are always welcome expecially from fellow users! ;-)

By the way, many thanks for the documentation on how to use LARQ with 
Fuseki. Very useful (and it will save me time... I can just point people 
to your page from now on).


Paolo



-Osma



Hi Elli!

It seems that at least part of your problem is having duplicates in the
LARQ index. Have you tried creating the Lucene index using the
larqbuilder command line tool, instead of removing the index and just
letting Fuseki rebuild it when it starts? See the end of my tutorial [1]
for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates
option, it will try to avoid duplicates in the index. Though the index
building will take longer.

I've also noticed that it usually makes sense to place the pf:textMatch
pattern first in the query, otherwise it will be executed many times and
slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:

  Hello,
 
 
  I am using Fuseki with Larq (thanks to Osma's recent instructions -
thanks Osma!)  where I recompiled Jena (after adding the Larq
dependency) to Jena revision 1399877 (this past Friday morning's version
of the trunk). I'm noticing the following anomaly when querying the data:
 
  First I insert the following triples:
  prefix xsd: http://www.w3.org/2001/XMLSchema#
  insert data {  graph urn:test:foo {
  urn:test:s1 urn:test:p1 foo^^xsd:string .
  urn:test:s1 urn:test:p2 foo^^xsd:string .
  urn:test:s2 urn:test:p3 foo^^xsd:string .
  } }
 
  Then I stop Fuseki, delete my index directory, and restart Fuseki.
(As an aside, I'd be very interested in a fix for this so I don't have
to restart Fuseki to rebuild the index - I'm watching JENA-164 and
hoping someone will be able to work on it soon!) Once Fuseki is back up,
I run the following query (I have default graph set as the union of
named graphs by default):
  PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
  select * where {
  urn:test:s1 ?p ?lit .
  ?lit pf:textMatch foo . }
 
  and I get 2 results as I expect:
 
  
  | p| lit  |
  
  | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
  | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |
  
  However, when I flip the order 

Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

On 16/11/12 22:20, Paolo Castagna wrote:

Elli, could you provide an example with some data and your query?


Apologies Elli, I now have found your example. ;-)

Paolo



Re: Fueski with Larq - query anomaly

2012-11-16 Thread Paolo Castagna

Hi Elli

On 23/10/12 16:47, Elli Schwarz wrote:



Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks 
Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena 
revision 1399877 (this past Friday morning's version of the trunk). I'm 
noticing the following anomaly when querying the data:

First I insert the following triples:
prefix xsd: http://www.w3.org/2001/XMLSchema#
insert data {  graph urn:test:foo {
  urn:test:s1 urn:test:p1 foo^^xsd:string .
  urn:test:s1 urn:test:p2 foo^^xsd:string .
  urn:test:s2 urn:test:p3 foo^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an 
aside, I'd be very interested in a fix for this so I don't have to restart 
Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be 
able to work on it soon!)


Re: JENA-164 ... yeah, I'd love to help you out, but it's a sort of 
architectural issue of Jena IMHO. It should be easier for developers to 
listen to events as triples are added/removed so that you can attach 
external indexes and keep them in sync.


There are multiple paths which you can use to change RDF data: APIs, 
SPARQL, etc. From a use point of view, you would like to keep your 
external index always in sync, no matter where the updates come from.


 Once Fuseki is back up, I run the following query (I have default 
graph set as the union of named graphs by default):

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
  urn:test:s1 ?p ?lit .
  ?lit pf:textMatch foo .
}

and I get 2 results as I expect:


| p | lit  |

| urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
| urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |

However, when I flip the order of my query like this:

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
  ?lit pf:textMatch foo .
  urn:test:s1 ?p ?lit .

I get 6 results, instead of the two I expect:


| lit  | p |

| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
My guess as to what happens is that in the 
second query, first the query executer executes the first line (the ?lit pf:textMatch foo) and this 
returns 3 results for foo, since there are 3 literals for foo. Then, the next line of the query has 
three bindings to ?lit, so it produces the 6 results above (2 for each foo literal since there are 2 
properties for urn:test:s1). I know that I can avoid this by using a SELECT DISTINCT, but I still think the 
query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT 
query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd 
rather avoid).


I am not sure, at the moment I have no clear idea on how this problem 
could be fixed.


Paolo



Another point I've noticed is that in my other (much more complex) queries, 
against a much larger dataset (~1.5 million triples), if I put the pf:textMatch 
line anywhere but in the very beginning of the query, the query takes a VERY 
long time to execute. If I put it as the first line in the query, the query 
runs quickly. My guess for this is that the query is executed in order, and it 
takes much more work for the query executer to run the other parts of my query 
which contain many results, and then have to go back and essentially filter out 
those results where the literal doesn't match the pf:textMatch. I can always 
place the pf:textMatch line first, but then I'm back to the problem mentioned 
above where I get back too many duplicate results.

Thank you very much for your help!
-Elli





Re: Fueski with Larq - query anomaly

2012-11-02 Thread Osma Suominen

Hi Elli!

[apparently your reply didn't come through the mailing list, but this 
one should]


31.10.2012 23:11, Elli Schwarz kirjoitti:

Thank you for the tip. Yes, if I generate the index using the
larqbuilder command, I don't get the duplicates in the query, regardless
of the placement of the pf:testMatch line. (As an aside, why does the
default behavior of creating the index allow duplicates, but the default
of the larqbuilder command does not?)


Good to hear that eliminating duplicates works for you. I have no idea 
why the defaults are as they are.



However, switching the order of where I place the pf:textMatch line
(while it may slow down the query), should not produce different
results, even if there are duplicates in the index. This would appear to
be a bug in how Larq applies the results of the index lookup to the query.


I'm not sure whether getting or not getting duplicates in specific 
situations can be considered a bug. But yes, the implementation of LARQ 
seems to be rather simplistic. It might help if the raw index results 
were filtered to weed out duplicates before applying them to the query. 
Then the choice whether to try to avoid duplicates during indexing would 
only be an optimization issue.


BTW I'm not (so far) a LARQ developer, just a fellow user..

-Osma



Hi Elli!

It seems that at least part of your problem is having duplicates in the
LARQ index. Have you tried creating the Lucene index using the
larqbuilder command line tool, instead of removing the index and just
letting Fuseki rebuild it when it starts? See the end of my tutorial [1]
for a recipe.

As I understand it, unless you give larqbuilder the --allow-duplicates
option, it will try to avoid duplicates in the index. Though the index
building will take longer.

I've also noticed that it usually makes sense to place the pf:textMatch
pattern first in the query, otherwise it will be executed many times and
slow down the whole query, sometimes by a lot.

Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:

  Hello,
 
 
  I am using Fuseki with Larq (thanks to Osma's recent instructions -
thanks Osma!)  where I recompiled Jena (after adding the Larq
dependency) to Jena revision 1399877 (this past Friday morning's version
of the trunk). I'm noticing the following anomaly when querying the data:
 
  First I insert the following triples:
  prefix xsd: http://www.w3.org/2001/XMLSchema#
  insert data {  graph urn:test:foo {
  urn:test:s1 urn:test:p1 foo^^xsd:string .
  urn:test:s1 urn:test:p2 foo^^xsd:string .
  urn:test:s2 urn:test:p3 foo^^xsd:string .
  } }
 
  Then I stop Fuseki, delete my index directory, and restart Fuseki.
(As an aside, I'd be very interested in a fix for this so I don't have
to restart Fuseki to rebuild the index - I'm watching JENA-164 and
hoping someone will be able to work on it soon!) Once Fuseki is back up,
I run the following query (I have default graph set as the union of
named graphs by default):
  PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
  select * where {
  urn:test:s1 ?p ?lit .
  ?lit pf:textMatch foo . }
 
  and I get 2 results as I expect:
 
  
  | p| lit  |
  
  | urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
  | urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |
  
  However, when I flip the order of my query like this:
 
  PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
  select * where {
  ?lit pf:textMatch foo . urn:test:s1 ?p ?lit .
  I get 6 results, instead of the two I expect:
 
  
  | lit  | p|
  
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
  | foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
 
My
guess as to what happens is that in the second query, first the query
executer executes the first line (the ?lit pf:textMatch foo) and this
returns 3 results for foo, since there are 3 literals for foo. Then,
the next line of the query has three bindings to ?lit, so it produces
the 6 results above (2 for each foo literal since there are 2
properties for urn:test:s1). I know that I can avoid this by using a
SELECT DISTINCT, but I still think the query 

Re: Fueski with Larq - query anomaly

2012-10-24 Thread Osma Suominen

Hi Elli!

It seems that at least part of your problem is having duplicates in the 
LARQ index. Have you tried creating the Lucene index using the larqbuilder 
command line tool, instead of removing the index and just letting Fuseki 
rebuild it when it starts? See the end of my tutorial [1] for a recipe.


As I understand it, unless you give larqbuilder the --allow-duplicates 
option, it will try to avoid duplicates in the index. Though the index 
building will take longer.


I've also noticed that it usually makes sense to place the pf:textMatch 
pattern first in the query, otherwise it will be executed many times and 
slow down the whole query, sometimes by a lot.


Hope this helps,
-Osma

[1] http://code.google.com/p/onki-light/wiki/InstallFusekiLARQ


On Tue, 23 Oct 2012, Elli Schwarz wrote:


Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks 
Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena 
revision 1399877 (this past Friday morning's version of the trunk). I'm 
noticing the following anomaly when querying the data:

First I insert the following triples:
prefix xsd: http://www.w3.org/2001/XMLSchema#
insert data {  graph urn:test:foo {
     urn:test:s1 urn:test:p1 foo^^xsd:string .
     urn:test:s1 urn:test:p2 foo^^xsd:string .
     urn:test:s2 urn:test:p3 foo^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an 
aside, I'd be very interested in a fix for this so I don't have to restart 
Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be 
able to work on it soon!) Once Fuseki is back up, I run the following query (I 
have default graph set as the union of named graphs by default):
PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
     urn:test:s1 ?p ?lit .
     ?lit pf:textMatch foo . 
}


and I get 2 results as I expect:


| p | lit  |

| urn:test:p1 | foo^^http://www.w3.org/2001/XMLSchema#string |
| urn:test:p2 | foo^^http://www.w3.org/2001/XMLSchema#string |

However, when I flip the order of my query like this:

PREFIX pf: http://jena.hpl.hp.com/ARQ/property#
select * where {
     ?lit pf:textMatch foo . 
     urn:test:s1 ?p ?lit . 


I get 6 results, instead of the two I expect:


| lit  | p |

| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p1 |
| foo^^http://www.w3.org/2001/XMLSchema#string | urn:test:p2 |
My guess as to what happens is that in the 
second query, first the query executer executes the first line (the ?lit pf:textMatch foo) and this 
returns 3 results for foo, since there are 3 literals for foo. Then, the next line of the query has 
three bindings to ?lit, so it produces the 6 results above (2 for each foo literal since there are 2 
properties for urn:test:s1). I know that I can avoid this by using a SELECT DISTINCT, but I still think the 
query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT 
query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd 
rather avoid).

Another point I've noticed is that in my other (much more complex) queries, 
against a much larger dataset (~1.5 million triples), if I put the pf:textMatch 
line anywhere but in the very beginning of the query, the query takes a VERY 
long time to execute. If I put it as the first line in the query, the query 
runs quickly. My guess for this is that the query is executed in order, and it 
takes much more work for the query executer to run the other parts of my query 
which contain many results, and then have to go back and essentially filter out 
those results where the literal doesn't match the pf:textMatch. I can always 
place the pf:textMatch line first, but then I'm back to the problem mentioned 
above where I get back too many duplicate results.

Thank you very much for your help!
-Elli


--
Osma Suominen | osma.suomi...@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing Research 
Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 Aalto, 
Finland