[jira] [Updated] (JENA-1453) jena-text Lucene docs contain graph field duplicates

Osma Suominen (JIRA) Tue, 30 Jan 2018 00:21:40 -0800

     [ 
https://issues.apache.org/jira/browse/JENA-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Osma Suominen updated JENA-1453:
--------------------------------
    Component/s: Text

> jena-text Lucene docs contain graph field duplicates
> ----------------------------------------------------
>
>                 Key: JENA-1453
>                 URL: https://issues.apache.org/jira/browse/JENA-1453
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: Jena, Text
>    Affects Versions: Jena 3.6.0
>         Environment: All
>            Reporter: Code Ferret
>            Assignee: Code Ferret
>            Priority: Minor
>             Fix For: Jena 3.7.0
>
>
> The current jena-text integration of Lucene has both duplicate and unused 
> fields that increase the required space and reduce the performance of the 
> Lucene integration.
> Consider:
> {code}
>     ex:SomeOne
>        a       ex:Item ;
>        skos:prefLabel "Some One" ;
>        skos:prefLabel "Some Neat One"@en ;
> {code}
> Assuming that:
> {code}
> [] a text:EntityMap ;
>     text:entityField      "uri" ;
>     text:uidField         "uid" ;
>     text:defaultField     "label" ;
>     text:langField        "lang" ;
>     text:graphField       "graph" ;
>     text:map (
>          [ text:field "label" ; 
>            text:predicate skos:prefLabel ]
> {code}
> and that {{text:multilingualSupport false ;}}, then
> The two Lucene documents that will be indexed appear as follows:
> {code}
> Document<
>   stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
>   indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
>   stored,indexed,tokenized<label:Some One>
>   stored,indexed,omitNorms,indexOptions=DOCS 
>     <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b> 
>   stored,indexed,tokenized<graph:http://example.org/G1> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee>
> >
> Document<
>   stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> 
>   indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> 
>   stored,indexed,tokenized<label:Some Neat One> 
>   stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f> 
>   stored,indexed,tokenized<graph:http://example.org/G1> 
>   stored,indexed,omitNorms,indexOptions=DOCS<lang:en> 
>   stored,indexed,omitNorms,indexOptions=DOCS
>     <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd>
> >
> {code}
> The {{graph}} field (and associated {{lang}} and {{uid}} fields) appear twice 
> in each document. The initial occurrence results from the {{text:graphField}} 
> configuration and the second is an artifact of 
> {{TextQueryFuncs.entityFromQuad}} adding the graph to the {{Entity}} via 
> {{entity.put(...)}}.
> This second occurrence of the graph field is not effective since there is no 
> search over tokenized graph URIs and there is currently no way to return the 
> graph field so no need to store it.
> It might well be a useful improvement to allow the graph field to be 
> retrieved via {{text:query}} PF but that would most reasonably be done by 
> adding the {{Field.Store.YES}} to the {{FieldType}} for the initial 
> occurrence of the graph field.
> The second occurrence of a {{uid}} field is the result of the unnecessary 
> graph occurrence resulting from the {{Entity}} to {{Document}} conversion in 
> {{TextLuceneIndex}}. This is never used since the purpose of the {{uid}} 
> field is to handle the deleting of documents from the Lucene index when a 
> triple is deleted and does not involve the graph URI.
> The solution is to delete lines 89-90 of {{TextQueryFuncs}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (JENA-1453) jena-text Lucene docs contain graph field duplicates

Reply via email to