Re: How to do text search with Jena and Fuseki

Kamble, Ajay, Crest Wed, 11 Nov 2015 08:31:16 -0800

Thank you Andy for reply.

1. Performance: I was able to solve it by ordering the triples correctly. I 
read a chapter in ‘Learning Sparql’ book on optimization. The problem in my 
query was that I started with a large set, for example give me all things A, 
then their Bs and use filter on B. Better option is give me all things B which 
have filter then their As. After this tuning all queries now return in under 1 
second, which is great.


2. I am trying to understand your feedback on Lucene index. Apology for not 
giving actual code, but here is a better representation.

        <#entMap> a text:EntityMap ; 
                 text:entityField "uri" ; 
                 text:defaultField "text" ; 
                 text:map ( 
                         [ text:field "text" ; text:predicate no:name ] 
                         [ text:field "text" ; text:predicate no:address ] 
                         [ text:field "text" ; text:predicate no:bio ] 
                         [ text:field "text" ; text:predicate no:qualification 
] 
                         [ text:field "text" ; text:predicate no:hobbies ] 
                        ) .

        I want the ability to do a free text search over all properties name, 
address, bio, qualification, hobbies in single query. Considering this is there 
anything wrong with my configuration?


-Ajay

> On Nov 11, 2015, at 4:54 PM, Andy Seaborne <[email protected]> wrote:
> 
> On 11/11/15 04:40, Kamble, Ajay, Crest wrote:
>> Thank you Andy for replying.
>> 
>> 1. I have a mix of constrained and free text queries. My constrained queries 
>> (or without free text/normal sparql queries) took 3-10 seconds. Free text 
>> queries took around 1 second.
>>     Do you mean that volume of Lucene index will affect constrained queries 
>> as well?
>>     At this point I had just included few concepts for Lucene index. Here is 
>> my configuration:
>> 
>> <#entMap> a text:EntityMap ;
>>   text:entityField "uri" ;
>> text:defaultField "text" ;
>> text:map ( [ text:field "text" ; text:predicate no:concept1 ]
> 
> concept1 is a class later one, not property.
> 
> If this is an anonymized setup+query, it's not helping in answering the 
> question.
> 
>>  [ text:field "text" ; text:predicate no:concept2 ]
>>  [ text:field "text" ; text:predicate no:concept3 ]
>>  [ text:field "text" ; text:predicate no:concept4 ]
>>  [ text:field "text" ; text:predicate no:concept5 ]
>>  [ text:field "text" ; text:predicate no:concept6 ] ) .
> 
> That uses the same Lucene filed fro each predicate - I'm not sure what will 
> happen.  At best, it puts all the index text in one field so Lucene has to 
> process all of them for any lookup.
> 
>> 
>> 2. Here is a sample query which takes 10+ seconds to execute. Is there 
>> anything wrong with this query (or possibility of optimization)?
> 
> The Lucene index and regex are unconnected.
> The Lucene index is accessed with a property function "text:query"
> http://jena.apache.org/documentation/query/text-query.html
> 
>> PREFIX ex:<http://example.com/ns/concepts#>
>> PREFIX d:<http://example.com/ns/data#>
>> 
>> SELECT DISTINCT ?a1
> 
> DISTINCT can hide a lot of work being done to find many, but few unique, 
> results.
> 
>> WHERE {
>>  ?n1 a ex:concept1 ;
>>  ex:concept2 ?c1 ;
> 
> concept as type and concept as property - looks odd to me.
> 
>>  ex:concept3 ?n2 ;
>>  ex:concept4 ?f1 ;
>>  ex:concept5 ?a1 .
>>  ?c1 ex:concept6 ?cn1 .
>>  ?f1 ex:concept7 ?fn1 .
> 
> Depending on the overall shape of your data, this is huge.  It does not start 
> anywhere so it might well be a scan of a lot of the database.
> 
> What's more multiple occurrences of properties on the same subject will lead 
> to fan out causing duplication of ?a1, then hidden by the DISTINCT.
> 
>>  FILTER (regex(?n2, "^word1", "i"))
>>  FILTER (regex(?cn1, "^word2$", "i"))
>>  FILTER (regex(?fn1, "^word3$", "i")) }
> 
> The way this query will execute is that the pattern part is executed, 
> probably generating lot matches with a lot of duplication of ?a1, and the 
> filters used to test the results.  Filters are pushed to the best place but 
> there is only so much they can do.
> 
> Better might be:
> (after sorting out the reuse of one field in the lucene index)
> 
>  # Look for all ?n2 of interest by concept2 in Lucene:
>  ?n2 text:query (ex:concept2 "word1") .
> 
>  # Then do pattern matching only for those ?n2
>  ?n1 ex:concept3 ?n2 .
>      ex:concept2 ?c1 ;
>      ex:concept4 ?f1 ;
>      ex:concept5 ?a1 .
>  ?c1 ex:concept6 ?cn1 .
>  ?f1 ex:concept7 ?fn1 .
>  # Checks
>  FILTER (regex(?cn1, "^word2$", "i"))
>  FILTER (regex(?fn1, "^word3$", "i")) }
> 
> You can start at word2 or word3 similarly - use the one with the last likely 
> matches.
> 
> You may need to keep the FILTERs if the way you get Lucene matches is more 
> general than the regex version (e.g. stemming matters).
> 
>       Andy
> 
>> 
>> 3. About Hardware, right now I am just running this on my MacBook Pro with 
>> 2.5 GHz Intel Core i7 and 16 GB of RAM.
>> 
>> It would be great if you could give me some suggestions or point me to any 
>> resource that explains Fuseki optimization.
>

Re: How to do text search with Jena and Fuseki

Reply via email to