Hi Osma On 20/08/12 11:10, Osma Suominen wrote: > Hi Paolo! > > Thanks for your quick reply. > > 17.08.2012 20:16, Paolo Castagna wrote: >> Does your problem go away without changing the code and using: >> ?lit pf:textMatch ( 'a*' 100000 ) > > I tested this but it didn't help. If I use a parameter less than 1000 > then I get even fewer hits, but values above 1000 don't have any effect.
Right. > I think the problem is this line in IndexLARQ.java: > > TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ; > > As you can see the parameter for maximum number of hits is taken > directly from the NUM_RESULTS constant. The value specified in the query > has no effect on this level. Correct. >> It's not a problem adding a couple of '0'... >> However, I am thinking that this would just shift the problem, isn't it? > > You're right, it would just shift the problem but a sufficiently large > value could be used that never caused problems in practice. Maybe you > could consider NUM_RESULTS = Integer.MAX_VALUE ? :) A lot of use cases about search are to used to drive a UI for people and often only the first few results are necessary. Try to continue hit 'next >>' on Google, how many results can you get? ;-) Anyway, I increased the NUM_RESULT constant. > Or maybe LARQ should use another variant of Lucene's > IndexSearcher.search(), one which takes a Collector object instead of > the integer n parameter. E.g. this: > http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29 Yes. That would be the thing to use if we want to retrieve all the results from Lucene. More thinking is necessary here... In the meantime, you can find a LARQ SNAPSHOT here: https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/ Paolo > > > Thanks, > Osma > > >> On 15/08/12 10:31, Osma Suominen wrote: >>> Hi Paolo! >>> >>> Thanks for your reply and sorry for the delay. >>> >>> I tested this again with today's svn snapshot and it's still a problem. >>> >>> However, after digging a bit further I found this in >>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java: >>> >>> --clip-- >>> // The number of results returned by default >>> public static final int NUM_RESULTS = 1000 ; // should >>> we increase this? -- PC >>> --clip-- >>> >>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed >>> my modified LARQ with mvn install (NB this required tweaking arq.ver >>> and tdb.ver in jena-larq/pom.xml to match the current svn versions), >>> rebuilt Fuseki and now the problem is gone! >>> >>> I would suggest that this constant be increased to something larger >>> than 1000. Based on the code comment, you seem to have had similar >>> thoughts sometime in the past :) >>> >>> Thanks, >>> Osma >>> >>> >>> 15.07.2012 11:21, Paolo Castagna kirjoitti: >>>> Hi Osma, >>>> first of all, thanks for sharing your experience and clearly describing >>>> your problem. >>>> Further comments inline. >>>> >>>> On 13/07/12 14:13, Osma Suominen wrote: >>>>> Hello! >>>>> >>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to >>>>> create a system for accessing SKOS thesauri. The user interface >>>>> includes an autocompletion widget. The idea is to use the LARQ index >>>>> to make fast prefix queries on the concept labels. >>>>> >>>>> However, I've noticed that in some situations I get less results from >>>>> the index than what I'd expect. This seems to happen when the LARQ >>>>> part of the query internally produces many hits, such as when doing a >>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*'). >>>>> >>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and >>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ >>>>> dependency to pom.xml and running mvn package. Other than this issue, >>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux >>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard >>>>> Ubuntu packages. >>>>> >>>>> >>>>> Steps to repeat: >>>>> >>>>> 1. package Fuseki with LARQ, as described above >>>>> >>>>> 2. start Fuseki with the attached configuration file, i.e. >>>>> ./fuseki-server --config=larq-config.ttl >>>>> >>>>> 3. I'm using the STW thesaurus as an easily accessible example data >>>>> set (though the problem was originally found with other data sets): >>>>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip >>>>> - unzip so you have stw.rdf >>>>> >>>>> 4. load the thesaurus file into the endpoint: >>>>> ./s-put http://localhost:3030/ds/data default stw.rdf >>>>> >>>>> 6. build the LARQ index, e.g. this way: >>>>> - kill Fuseki >>>>> - rm -r /tmp/lucene >>>>> - start Fuseki again, so the index will be built >>>>> >>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030 >>>>> >>>>> First try this SPARQL query: >>>>> >>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#> >>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#> >>>>> SELECT DISTINCT * WHERE { >>>>> ?lit pf:textMatch "ar*" . >>>>> ?conc skos:prefLabel ?lit . >>>>> FILTER(REGEX(?lit, '^ar.*', 'i')) >>>>> } ORDER BY ?lit >>>>> >>>>> I get 120 hits, including "Arab"@en. >>>>> >>>>> Now try the same query, but change the pf:textMatch argument to "a*". >>>>> This way I get only 32 results, not including "Arab"@en, even though >>>>> the shorter prefix query should match a superset of what was matched >>>>> by the first query (the regex should still filter it down to the same >>>>> result set). >>>>> >>>>> >>>>> This issue is not just about single character prefix queries. With >>>>> enough data sets loaded into the same index, this happens with longer >>>>> prefix queries as well. >>>>> >>>>> I think that the problem might be related to Lucene's default >>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus >>>>> prefix query matches), as described in the Lucene FAQ: >>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F >>>>> >>>>> >>>>> >>>> >>>> Yes, I think your hypothesis might be correct (I've not verified it >>>> yet). >>>> >>>>> In case this is the problem, is there any way to tell LARQ to use a >>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is >>>>> not triggered? I find it a bit disturbing that hits are silently being >>>>> lost. I couldn't see any special output on the Fuseki log. >>>> >>>> Not sure about this. >>>> >>>> Paolo >>>> >>>>> >>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I >>>>> can of course make a bug report. >>>>> >>>>> >>>>> Thanks and best regards, >>>>> Osma Suominen >>>>> >>>> >>>> >>> >>> >> > >