Re: How to do text search with Jena and Fuseki

Kamble, Ajay, Crest Tue, 10 Nov 2015 06:52:18 -0800

Hello,

1. Setup for Free Text Search


        In assembler file I had to put two entries, 1 for TDB dataset and 1 for 
Lucene indexed. After this change I was able to do free text queries for my TDB 
dataset. However, I am not sure if this is the correct way.

        <#service> rdf:type fuseki:Service ; 
         fuseki:name “mydb” ;# http://host:port/tdb 
         fuseki:serviceQuery "query" ; # SPARQL query service 
         fuseki:serviceQuery "sparql" ; # SPARQL query service 
         fuseki:serviceUpdate "update" ; # SPARQL query service 
         fuseki:serviceUpload "upload" ; # Non-SPARQL upload service 
         fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store 
protocol (read and write) 
         fuseki:dataset <#dataset> ; 
         #fuseki:dataset :text_dataset ; 
        . 

         <#service_text_tdb> rdf:type fuseki:Service ; 
         fuseki:name "fts" ; # http://host:port/tdb 
         fuseki:serviceQuery "query" ; # SPARQL query service 
         fuseki:serviceQuery "sparql" ; # SPARQL query service 
         fuseki:serviceUpdate "update" ; # SPARQL query service 
         fuseki:serviceUpload "upload" ; # Non-SPARQL upload service 
         fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store 
protocol (read and write) 
         #fuseki:dataset <#dataset> ; 
         fuseki:dataset :text_dataset ; 
        .

2. Performance

        I was trying to evaluate Jena+Fuseki for a project. The number of 
triples that I put in Fuseki is 3161033. Our queries are of search type, for 
example, given a search term/phrase get count of results, first 20 results and 
some facets. All queries took between 3-10 seconds to execute, which was 
disappointing. 

To be fair, I do not have much knowledge and I have just done basic setup at 
this point. 
        Are there any ways to get a better performance? 
        Is the data size a problem here? The count of triples is only going to 
increase.
        Can it give better or comparable performance than Neo4J for same data?

Interestingly, free text search returned much earlier than other queries, it 
took roughly 1 second.

3. Other Triplestore

        What other triplestore can be used if high performance is required 
along with ability to do free text search?

-Ajay

> On Nov 4, 2015, at 10:10 PM, Andy Seaborne <[email protected]> wrote:
> 
> On 04/11/15 16:11, Kamble, Ajay, Crest wrote:
>> I created text index with this command:
>> 
>> java -cp fuseki-server.jar jena.textindexer --desc=/tmp/fuseki-assembler.ttl
> 
> This must be done after you removed tdb:unionDefaultGraph
> 
> Then check the place where you have stored the text index (and check there 
> are not two on your disk - you gave it a relative file name() and see if it 
> has any data in it.
> 
> 
>       Andy
> 
>> 
>> -Regards
>> Ajay
>> 
>> On Nov 4, 2015, at 9:28 PM, Kamble, Ajay, Crest 
>> <[email protected]<mailto:[email protected]>> wrote:
>> 
>> Hi Andy,
>> 
>> Thanks for help. My server was able to access data after commenting 
>> ‘tdb:unionDefaultGraph’.
>> 
>> But the free text search that I tried did not work. I tried following query 
>> but I got 0 results.
>> 
>> PREFIX text: <http://jena.apache.org/text#>
>> 
>> SELECT ?s
>> {
>>    ?s text:query 'gold' .
>> }
>> 
>> Is my configuration for text search correct. Also how do I specify 2 
>> datasets in single service?
>> 
>> Here is snippet from configuration:
>> 
>> # Text index description
>> <#indexLucene> a text:TextIndexLucene ;
>>    text:directory <file:Lucene> ;
>>    ##text:directory "mem" ;
>>    text:entityMap <#entMap> ;
>>    .
>> 
>> # Mapping in the index
>> # URI stored in field "uri"
>> # rdfs:label is mapped to field "text"
>> <#entMap> a text:EntityMap ;
>>    text:entityField      "uri" ;
>>    text:defaultField     "text" ;
>>    text:map (
>>         [ text:field "text" ; text:predicate no:name ]
>>         [ text:field "text" ; text:predicate no:alt-name ]
>>         [ text:field "text" ; text:predicate no:name ]
>>         [ text:field "text" ; text:predicate no:title ]
>>         [ text:field "text" ; text:predicate no:author ]
>>         [ text:field "text" ; text:predicate no:inventor ]
>>         ) .
>> 
>> [] rdf:type fuseki:Server ;
>>   # Server-wide context parameters can be given here.
>>   # For example, to set query timeouts: on a server-wide basis:
>>   # Format 1: "1000" -- 1 second timeout
>>   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
>> to for rest of query.
>>   # See java doc for ARQ.queryTimeout
>>   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
>> 
>>   # Load custom code (rarely needed)
>>   # ja:loadClass "your.code.Class" ;
>> 
>>   # Services available.  Only explicitly listed services are configured.
>>   #  If there is a service description not linked from this list, it is 
>> ignored.
>>   fuseki:services (
>>     <#service>
>>     #<#service_text_tdb>
>>   ) .
>> 
>> <#service>  rdf:type fuseki:Service ;
>>    fuseki:name              “mydb" ;       # http://host:port/tdb
>>    fuseki:serviceQuery               "query" ;    # SPARQL query service
>>    fuseki:serviceQuery               "sparql" ;   # SPARQL query service
>>    fuseki:serviceUpdate              "update" ;   # SPARQL query service
>>    fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
>>    fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store 
>> protocol (read and write)
>>    fuseki:dataset           <#dataset> ;
>>    #fuseki:dataset                  :text_dataset ;
>> .
>> 
>> -Regards
>> Ajay
>> 
>> On Nov 4, 2015, at 7:54 PM, Andy Seaborne 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
>> 
>> On 04/11/15 14:11, Kamble, Ajay, Crest wrote:
>> That worked for me. Also the option is —config and not —conf.
>> 
>> --config and --conf are synomys.
>> 
>> And it's "-" or "--" but not the en-dash or em-dash character your email is 
>> putting in.
>> 
>> 
>> Fuseki starts but it does not read my existing data. If I execute simple 
>> query to get count of triples, I get 0. Also, Fuseki gives this warning - 
>> Dataset not found: No session.
>> 
>> Check the config file.
>> 
>> Try without "tdb:unionDefaultGraph true"
>> 
>> 
>> If I start Fuseki with —loc option and not —config, then it correctly reads 
>> all data and the same query gives correct count.
>> 
>> --loc is shorthand for TDB only, no text dataset, no default union graph.
>> 
>> 
>> Is there anything wrong with the way I have configured dataset in assembler 
>> file?
>> 
>> Also, do I need to create 2 different services for normal sprawl query and 
>> text search?
>> 
>> If the query has no text:query, it executes like a plain SPARQL query on the 
>> TDB datasets.
>> 
>> In other words, can I execute both types of queries in single console or not?
>> 
>> -Regards
>> Ajay
>> 
>> 
>> On Nov 4, 2015, at 7:35 PM, Andy Seaborne 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
>> 
>> On 04/11/15 13:59, Kamble, Ajay, Crest wrote:
>> Hi Andy,
>> 
>> I tried that but it did not work. I got another error,
>> 
>> fuseki-server --update —conf=/tmp/fuseki-assembler.ttl /mydb
>> Required: either --config=FILE or one of --mem, --file, --loc or --desc
>> 
>> fuseki-server --conf=/tmp/fuseki-assembler.ttl
>> 
>> The service name is in teh assembler file - you can't give it again on the 
>> command line.
>> 
>> Andy
>> 
>> 
>> -Regards
>> Ajay
>> 
>> On Nov 4, 2015, at 5:43 PM, Andy Seaborne 
>> <[email protected]<mailto:[email protected]><mailto:[email protected]>> wrote:
>> 
>> Change "--desc" to "--conf"
>> 
>> "--desc" works in the restricted case when there is one dataset description 
>> - but in this case there are two - the TDB dataset and the test dataset 
>> built over that.
>> 
>> Andy
>> 
>> On 04/11/15 12:10, Kamble, Ajay, Crest wrote:
>> Hi All,
>> 
>> 1. Triplestore
>> 
>> I have an existing Triplestore that I setup by putting data in Fuseki. I 
>> used Java code to put all triples in Fuseki (here is url that I used - 
>> http://localhost:3030/mydb/data). Before starting loading of data I start 
>> Fuseki with this command:
>> 
>> fuseki-server --update --loc=/tmp/fuseki-tdb /mydb
>> (on Mac OS X).
>> 
>> My database is located at /tmp/fuseki-tdb
>> 
>> This setup works well and I can query all triples from console.
>> 
>> 2. Free Text Search
>> 
>> I need to setup free text search on top of this Triplestore, so that normal 
>> Sparql queries and free text queries are both possible.
>> 
>> Here is the assembler file that I used.
>> 
>> @prefix :        <http://mydb.com/ns/dataset#> .
>> @prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
>> @prefix text:    <http://jena.apache.org/text#> .
>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>> @prefix no: <http://mydb.com/ns/concepts#> .
>> @prefix d: <http://mydb.com/ns/data#> .
>> 
>> ## Example of a TDB dataset and text index
>> ## Initialize TDB
>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>> 
>> ## Initialize text query
>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>> # A TextDataset is a regular dataset with a text index.
>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>> # Lucene index
>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>> # Solr index
>> text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .
>> 
>> ## ---------------------------------------------------------------
>> ## This URI must be fixed - it's used to assemble the text dataset.
>> 
>> :text_dataset rdf:type     text:TextDataset ;
>>    text:dataset   <#dataset> ;
>>    text:index     <#indexLucene> ;
>>    .
>> 
>> # A TDB datset used for RDF storage
>> <#dataset> rdf:type      tdb:DatasetTDB ;
>>    tdb:location “/tmp/fuseki-tdb" ;
>>    tdb:unionDefaultGraph true ; # Optional
>>    .
>> 
>> # Text index description
>> <#indexLucene> a text:TextIndexLucene ;
>>    text:directory <file:Lucene> ;
>>    ##text:directory "mem" ;
>>    text:entityMap <#entMap> ;
>>    .
>> 
>> # Mapping in the index
>> # URI stored in field "uri"
>> # rdfs:label is mapped to field "text"
>> <#entMap> a text:EntityMap ;
>>    text:entityField      "uri" ;
>>    text:defaultField     "text" ;
>>    text:map (
>>         [ text:field "text" ; text:predicate no:name ]
>>         [ text:field "text" ; text:predicate no:alt-name ]
>>         [ text:field "text" ; text:predicate no:name ]
>>         [ text:field "text" ; text:predicate no:title ]
>>         [ text:field "text" ; text:predicate no:author ]
>>         [ text:field "text" ; text:predicate no:inventor ]
>>         ) .
>> 
>> [] rdf:type fuseki:Server ;
>>   # Server-wide context parameters can be given here.
>>   # For example, to set query timeouts: on a server-wide basis:
>>   # Format 1: "1000" -- 1 second timeout
>>   # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout 
>> to for rest of query.
>>   # See java doc for ARQ.queryTimeout
>>   # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;
>> 
>>   # Load custom code (rarely needed)
>>   # ja:loadClass "your.code.Class" ;
>> 
>>   # Services available.  Only explicitly listed services are configured.
>>   #  If there is a service description not linked from this list, it is 
>> ignored.
>>   fuseki:services (
>>     <#service>
>>     #<#service_text_tdb>
>>   ) .
>> 
>> <#service>  rdf:type fuseki:Service ;
>>    fuseki:name              “mydb" ;       # http://host:port/tdb
>>    fuseki:serviceQuery               "query" ;    # SPARQL query service
>>    fuseki:serviceQuery               "sparql" ;   # SPARQL query service
>>    fuseki:serviceUpdate              "update" ;   # SPARQL query service
>>    fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
>>    fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store 
>> protocol (read and write)
>>    fuseki:dataset           <#dataset> ;
>>    fuseki:dataset                  :text_dataset ;
>> .
>> 
>> With this assembler file, I start my server with following command,
>> 
>> fuseki-server --update 
>> --desc=/Users/kamb16/projects/nano/data/fuseki-assembler.ttl /mydb
>> 
>> I get following error,
>> 
>> com.hp.hpl.jena.sparql.ARQException: Found two matches: var ?root -> 
>> http://mydb.com/ns/dataset#text_dataset, 
>> file:///tmp/fuseki-assembler.ttl#dataset
>> at com.hp.hpl.jena.sparql.util.QueryExecUtils.getOne(QueryExecUtils.java:360)
>> at 
>> com.hp.hpl.jena.sparql.util.graph.GraphUtils.findRootByType(GraphUtils.java:194)
>> at 
>> com.hp.hpl.jena.sparql.core.assembler.AssemblerUtils.build(AssemblerUtils.java:91)
>> at arq.cmdline.ModAssembler.create(ModAssembler.java:68)
>> at arq.cmdline.ModDatasetAssembler.createDataset(ModDatasetAssembler.java:43)
>> at org.apache.jena.fuseki.FusekiCmd.processModulesAndArgs(FusekiCmd.java:307)
>> at arq.cmdline.CmdArgModule.process(CmdArgModule.java:50)
>> at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
>> at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
>> at org.apache.jena.fuseki.FusekiCmd.main(FusekiCmd.java:166)
>> 
>> I do not understand how to fix this issue. Could you please help? I want to 
>> do regular Sparql queries as well as Free text search.
>> 
>> Regards,
>> Ajay
>> 
>

Re: How to do text search with Jena and Fuseki

Reply via email to