[ https://issues.apache.org/jira/browse/JENA-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Osma Suominen updated JENA-1453: -------------------------------- Component/s: Text > jena-text Lucene docs contain graph field duplicates > ---------------------------------------------------- > > Key: JENA-1453 > URL: https://issues.apache.org/jira/browse/JENA-1453 > Project: Apache Jena > Issue Type: Improvement > Components: Jena, Text > Affects Versions: Jena 3.6.0 > Environment: All > Reporter: Code Ferret > Assignee: Code Ferret > Priority: Minor > Fix For: Jena 3.7.0 > > > The current jena-text integration of Lucene has both duplicate and unused > fields that increase the required space and reduce the performance of the > Lucene integration. > Consider: > {code} > ex:SomeOne > a ex:Item ; > skos:prefLabel "Some One" ; > skos:prefLabel "Some Neat One"@en ; > {code} > Assuming that: > {code} > [] a text:EntityMap ; > text:entityField "uri" ; > text:uidField "uid" ; > text:defaultField "label" ; > text:langField "lang" ; > text:graphField "graph" ; > text:map ( > [ text:field "label" ; > text:predicate skos:prefLabel ] > {code} > and that {{text:multilingualSupport false ;}}, then > The two Lucene documents that will be indexed appear as follows: > {code} > Document< > stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> > indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> > stored,indexed,tokenized<label:Some One> > stored,indexed,omitNorms,indexOptions=DOCS > <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b> > stored,indexed,tokenized<graph:http://example.org/G1> > stored,indexed,omitNorms,indexOptions=DOCS > <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee> > > > Document< > stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne> > indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1> > stored,indexed,tokenized<label:Some Neat One> > stored,indexed,omitNorms,indexOptions=DOCS<lang:en> > stored,indexed,omitNorms,indexOptions=DOCS > <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f> > stored,indexed,tokenized<graph:http://example.org/G1> > stored,indexed,omitNorms,indexOptions=DOCS<lang:en> > stored,indexed,omitNorms,indexOptions=DOCS > <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd> > > > {code} > The {{graph}} field (and associated {{lang}} and {{uid}} fields) appear twice > in each document. The initial occurrence results from the {{text:graphField}} > configuration and the second is an artifact of > {{TextQueryFuncs.entityFromQuad}} adding the graph to the {{Entity}} via > {{entity.put(...)}}. > This second occurrence of the graph field is not effective since there is no > search over tokenized graph URIs and there is currently no way to return the > graph field so no need to store it. > It might well be a useful improvement to allow the graph field to be > retrieved via {{text:query}} PF but that would most reasonably be done by > adding the {{Field.Store.YES}} to the {{FieldType}} for the initial > occurrence of the graph field. > The second occurrence of a {{uid}} field is the result of the unnecessary > graph occurrence resulting from the {{Entity}} to {{Document}} conversion in > {{TextLuceneIndex}}. This is never used since the purpose of the {{uid}} > field is to handle the deleting of documents from the Lucene index when a > triple is deleted and does not involve the graph URI. > The solution is to delete lines 89-90 of {{TextQueryFuncs}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005)