Re: How to do text search with Jena and Fuseki

Kamble, Ajay, Crest Tue, 10 Nov 2015 20:41:12 -0800

Thank you Andy for replying.

1. I have a mix of constrained and free text queries. My constrained queries 
(or without free text/normal sparql queries) took 3-10 seconds. Free text 
queries took around 1 second.
    Do you mean that volume of Lucene index will affect constrained queries as 
well?
    At this point I had just included few concepts for Lucene index. Here is my 
configuration:


<#entMap> a text:EntityMap ;
  text:entityField "uri" ;
text:defaultField "text" ;
text:map ( [ text:field "text" ; text:predicate no:concept1 ]
 [ text:field "text" ; text:predicate no:concept2 ]
 [ text:field "text" ; text:predicate no:concept3 ]
 [ text:field "text" ; text:predicate no:concept4 ]
 [ text:field "text" ; text:predicate no:concept5 ]
 [ text:field "text" ; text:predicate no:concept6 ] ) .

2. Here is a sample query which takes 10+ seconds to execute. Is there anything 
wrong with this query (or possibility of optimization)?

PREFIX ex: <http://example.com/ns/concepts#>
PREFIX d: <http://example.com/ns/data#>

SELECT DISTINCT ?a1
WHERE {
 ?n1 a ex:concept1 ;
 ex:concept2 ?c1 ;
 ex:concept3 ?n2 ;
 ex:concept4 ?f1 ;
 ex:concept5 ?a1 .
 ?c1 ex:concept6 ?cn1 .
 ?f1 ex:concept7 ?fn1 .
 FILTER (regex(?n2, "^word1", "i"))
 FILTER (regex(?cn1, "^word2$", "i"))
 FILTER (regex(?fn1, "^word3$", "i")) }

3. About Hardware, right now I am just running this on my MacBook Pro with 2.5 
GHz Intel Core i7 and 16 GB of RAM.

It would be great if you could give me some suggestions or point me to any 
resource that explains Fuseki optimization.

-Ajay

On Nov 11, 2015, at 4:27 AM, Andy Seaborne 
<[email protected]<mailto:[email protected]>> wrote:

I was trying to evaluate Jena+Fuseki for a project. The number of
triples that I put in Fuseki is 3161033. Our queries are of search
type, for example, given a search term/phrase get count of results,
first 20 results and some facets. All queries took between 3-10
seconds to execute, which was disappointing.

3 million triples.  That's not very many.  It will depend on how much is 
indexed into Lucene and what the query actually is but elsewhere I've seen much 
larger datasets with text query running much faster.

There are lots of possible systems factors such as hardware, server or client 
restarts (this java!) and how you ask the server query.

Andy


On 10/11/15 14:51, Kamble, Ajay, Crest wrote:
Hello,

1. Setup for Free Text Search

In assembler file I had to put two entries, 1 for TDB dataset and 1 for Lucene 
indexed. After this change I was able to do free text queries for my TDB 
dataset. However, I am not sure if this is the correct way.

<#service> rdf:type fuseki:Service ;
 fuseki:name “mydb” ;# http://host:port/tdb
 fuseki:serviceQuery "query" ; # SPARQL query service
 fuseki:serviceQuery "sparql" ; # SPARQL query service
 fuseki:serviceUpdate "update" ; # SPARQL query service
 fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
 fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol (read 
and write)
 fuseki:dataset <#dataset> ;
 #fuseki:dataset :text_dataset ;
.

 <#service_text_tdb> rdf:type fuseki:Service ;
 fuseki:name "fts" ; # http://host:port/tdb
 fuseki:serviceQuery "query" ; # SPARQL query service
 fuseki:serviceQuery "sparql" ; # SPARQL query service
 fuseki:serviceUpdate "update" ; # SPARQL query service
 fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
 fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol (read 
and write)
 #fuseki:dataset <#dataset> ;
 fuseki:dataset :text_dataset ;
.

2. Performance

I was trying to evaluate Jena+Fuseki for a project. The number of triples that 
I put in Fuseki is 3161033. Our queries are of search type, for example, given 
a search term/phrase get count of results, first 20 results and some facets. 
All queries took between 3-10 seconds to execute, which was disappointing.

To be fair, I do not have much knowledge and I have just done basic setup at 
this point.
Are there any ways to get a better performance?
Is the data size a problem here? The count of triples is only going to increase.
Can it give better or comparable performance than Neo4J for same data?

Interestingly, free text search returned much earlier than other queries, it 
took roughly 1 second.

3. Other Triplestore

What other triplestore can be used if high performance is required along with 
ability to do free text search?

-Ajay

On Nov 4, 2015, at 10:10 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]>> wrote:

On 04/11/15 16:11, Kamble, Ajay, Crest wrote:
I created text index with this command:

java -cp fuseki-server.jar jena.textindexer --desc=/tmp/fuseki-assembler.ttl

This must be done after you removed tdb:unionDefaultGraph

Then check the place where you have stored the text index (and check there are 
not two on your disk - you gave it a relative file name() and see if it has any 
data in it.


Andy


-Regards
Ajay

On Nov 4, 2015, at 9:28 PM, Kamble, Ajay, Crest 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
 wrote:

Hi Andy,

Thanks for help. My server was able to access data after commenting 
‘tdb:unionDefaultGraph’.

But the free text search that I tried did not work. I tried following query but 
I got 0 results.

PREFIX text: <http://jena.apache.org/text#>

SELECT ?s
{
   ?s text:query 'gold' .
}

Is my configuration for text search correct. Also how do I specify 2 datasets 
in single service?

Here is snippet from configuration:

# Text index description
<#indexLucene> a text:TextIndexLucene ;
   text:directory <file:Lucene> ;
   ##text:directory "mem" ;
   text:entityMap <#entMap> ;
   .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
   text:entityField      "uri" ;
   text:defaultField     "text" ;
   text:map (
        [ text:field "text" ; text:predicate no:name ]
        [ text:field "text" ; text:predicate no:alt-name ]
        [ text:field "text" ; text:predicate no:name ]
        [ text:field "text" ; text:predicate no:title ]
        [ text:field "text" ; text:predicate no:author ]
        [ text:field "text" ; text:predicate no:inventor ]
        ) .

[] rdf:type fuseki:Server ;
  # Server-wide context parameters can be given here.
  # For example, to set query timeouts: on a server-wide basis:
  # Format 1: "1000" -- 1 second timeout
  # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to 
for rest of query.
  # See java doc for ARQ.queryTimeout
  # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;

  # Load custom code (rarely needed)
  # ja:loadClass "your.code.Class" ;

  # Services available.  Only explicitly listed services are configured.
  #  If there is a service description not linked from this list, it is ignored.
  fuseki:services (
    <#service>
    #<#service_text_tdb>
  ) .

<#service>  rdf:type fuseki:Service ;
   fuseki:name              “mydb" ;       # http://host:port/tdb
   fuseki:serviceQuery               "query" ;    # SPARQL query service
   fuseki:serviceQuery               "sparql" ;   # SPARQL query service
   fuseki:serviceUpdate              "update" ;   # SPARQL query service
   fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
   fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store protocol 
(read and write)
   fuseki:dataset           <#dataset> ;
   #fuseki:dataset                  :text_dataset ;
.

-Regards
Ajay

On Nov 4, 2015, at 7:54 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

On 04/11/15 14:11, Kamble, Ajay, Crest wrote:
That worked for me. Also the option is —config and not —conf.

--config and --conf are synomys.

And it's "-" or "--" but not the en-dash or em-dash character your email is 
putting in.


Fuseki starts but it does not read my existing data. If I execute simple query 
to get count of triples, I get 0. Also, Fuseki gives this warning - Dataset not 
found: No session.

Check the config file.

Try without "tdb:unionDefaultGraph true"


If I start Fuseki with —loc option and not —config, then it correctly reads all 
data and the same query gives correct count.

--loc is shorthand for TDB only, no text dataset, no default union graph.


Is there anything wrong with the way I have configured dataset in assembler 
file?

Also, do I need to create 2 different services for normal sprawl query and text 
search?

If the query has no text:query, it executes like a plain SPARQL query on the 
TDB datasets.

In other words, can I execute both types of queries in single console or not?

-Regards
Ajay


On Nov 4, 2015, at 7:35 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

On 04/11/15 13:59, Kamble, Ajay, Crest wrote:
Hi Andy,

I tried that but it did not work. I got another error,

fuseki-server --update —conf=/tmp/fuseki-assembler.ttl /mydb
Required: either --config=FILE or one of --mem, --file, --loc or --desc

fuseki-server --conf=/tmp/fuseki-assembler.ttl

The service name is in teh assembler file - you can't give it again on the 
command line.

Andy


-Regards
Ajay

On Nov 4, 2015, at 5:43 PM, Andy Seaborne 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

Change "--desc" to "--conf"

"--desc" works in the restricted case when there is one dataset description - 
but in this case there are two - the TDB dataset and the test dataset built 
over that.

Andy

On 04/11/15 12:10, Kamble, Ajay, Crest wrote:
Hi All,

1. Triplestore

I have an existing Triplestore that I setup by putting data in Fuseki. I used 
Java code to put all triples in Fuseki (here is url that I used - 
http://localhost:3030/mydb/data). Before starting loading of data I start 
Fuseki with this command:

fuseki-server --update --loc=/tmp/fuseki-tdb /mydb
(on Mac OS X).

My database is located at /tmp/fuseki-tdb

This setup works well and I can query all triples from console.

2. Free Text Search

I need to setup free text search on top of this Triplestore, so that normal 
Sparql queries and free text queries are both possible.

Here is the assembler file that I used.

@prefix :        <http://mydb.com/ns/dataset#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix no: <http://mydb.com/ns/concepts#> .
@prefix d: <http://mydb.com/ns/data#> .

## Example of a TDB dataset and text index
## Initialize TDB
[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
# Solr index
text:TextIndexSolr    rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
   text:dataset   <#dataset> ;
   text:index     <#indexLucene> ;
   . I was trying to evaluate Jena+Fuseki for a project. The number of triples 
that I put in Fuseki is 3161033. Our queries are of search type, for example, 
given a search term/phrase get count of results, first 20 results and some 
facets. All queries took between 3-10 seconds to execute, which was 
disappointing.

# A TDB datset used for RDF storage
<#dataset> rdf:type      tdb:DatasetTDB ;
   tdb:location “/tmp/fuseki-tdb" ;
   tdb:unionDefaultGraph true ; # Optional
   .

# Text index description
<#indexLucene> a text:TextIndexLucene ;
   text:directory <file:Lucene> ;
   ##text:directory "mem" ;
   text:entityMap <#entMap> ;
   .

# Mapping in the index
# URI stored in field "uri"
# rdfs:label is mapped to field "text"
<#entMap> a text:EntityMap ;
   text:entityField      "uri" ;
   text:defaultField     "text" ;
   text:map (
        [ text:field "text" ; text:predicate no:name ]
        [ text:field "text" ; text:predicate no:alt-name ]
        [ text:field "text" ; text:predicate no:name ]
        [ text:field "text" ; text:predicate no:title ]
        [ text:field "text" ; text:predicate no:author ]
        [ text:field "text" ; text:predicate no:inventor ]
        ) .

[] rdf:type fuseki:Server I was trying to evaluate Jena+Fuseki for a project. 
The number of triples that I put in Fuseki is 3161033. Our queries are of 
search type, for example, given a search term/phrase get count of results, 
first 20 results and some facets. All queries took between 3-10 seconds to 
execute, which was disappointing. ;
  # Server-wide context parameters can be given here.
  # For example, to set query timeouts: on a server-wide basis:
  # Format 1: "1000" -- 1 second timeout
  # Format 2: "10000,60000" -- 10s timeout to first result, then 60s timeout to 
for rest of query.
  # See java doc for ARQ.queryTimeout
  # ja:context [ ja:cxtName "arq:queryTimeout" ;  ja:cxtValue "10000" ] ;

  # Load custom code (rarely needed)
  # ja:loadClass "your.code.Class" ;

  # Services available.  Only explicitly listed services are configured.
  #  If there is a service description not linked from this list, it is ignored.
  fuseki:services (
    <#service>
    #<#service_text_tdb>
  ) .

<#service>  rdf:type fuseki:Service ;
   fuseki:name              “mydb" ;       # http://host:port/tdb
   fuseki:serviceQuery               "query" ;    # SPARQL query service
   fuseki:serviceQuery               "sparql" ;   # SPARQL query service
   fuseki:serviceUpdate              "update" ;   # SPARQL query service
   fuseki:serviceUpload              "upload" ;   # Non-SPARQL upload service
   fuseki:serviceReadWriteGraphStore "data" ;     # SPARQL Graph store protocol 
(read and write)
   fuseki:dataset           <#dataset> ;
   fuseki:dataset                  :text_dataset ;
.

With this assembler file, I start my server with following command,

fuseki-server --update 
--desc=/Users/kamb16/projects/nano/data/fuseki-assembler.ttl /mydb

I get following error,

com.hp.hpl.jena.sparql.ARQException: Found two matches: var ?root -> 
http://mydb.com/ns/dataset#text_dataset, 
file:///tmp/fuseki-assembler.ttl#dataset
at com.hp.hpl.jena.sparql.util.QueryExecUtils.getOne(QueryExecUtils.java:360)
at 
com.hp.hpl.jena.sparql.util.graph.GraphUtils.findRootByType(GraphUtils.java:194)
at 
com.hp.hpl.jena.sparql.core.assembler.AssemblerUtils.build(AssemblerUtils.java:91)
at arq.cmdline.ModAssembler.create(ModAssembler.java:68)
at arq.cmdline.ModDatasetAssembler.createDataset(ModDatasetAssembler.java:43)
at org.apache.jena.fuseki.FusekiCmd.processModulesAndArgs(FusekiCmd.java:307)
at arq.cmdline.CmdArgModule.process(CmdArgModule.java:50)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:101)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:63)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:50)
at org.apache.jena.fuseki.FusekiCmd.main(FusekiCmd.java:166)

I do not understand how to fix this issue. Could you please help? I want to do 
regular Sparql queries as well as Free text search.

Regards,
Ajay

Re: How to do text search with Jena and Fuseki

Reply via email to