[
https://issues.apache.org/jira/browse/JENA-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321981#comment-16321981
]
ASF subversion and git services commented on JENA-1453:
-------------------------------------------------------
Commit 1820850 from [~andy.seaborne] in branch 'site/trunk'
[ https://svn.apache.org/r1820850 ]
JENA-1453 Updates
> jena-text Lucene docs contain graph field duplicates
> ----------------------------------------------------
>
> Key: JENA-1453
> URL: https://issues.apache.org/jira/browse/JENA-1453
> Project: Apache Jena
> Issue Type: Improvement
> Components: Jena
> Affects Versions: Jena 3.6.0
> Environment: All
> Reporter: Code Ferret
> Assignee: Code Ferret
> Priority: Minor
> Fix For: Jena 3.7.0
>
>
> The current jena-text integration of Lucene has both duplicate and unused
> fields that increase the required space and reduce the performance of the
> Lucene integration.
> Consider:
> {code}
> ex:SomeOne
> a ex:Item ;
> skos:prefLabel "Some One" ;
> skos:prefLabel "Some Neat One"@en ;
> {code}
> Assuming that:
> {code}
> [] a text:EntityMap ;
> text:entityField "uri" ;
> text:uidField "uid" ;
> text:defaultField "label" ;
> text:langField "lang" ;
> text:graphField "graph" ;
> text:map (
> [ text:field "label" ;
> text:predicate skos:prefLabel ]
> {code}
> and that {{text:multilingualSupport false ;}}, then
> The two Lucene documents that will be indexed appear as follows:
> {code}
> Document<
> stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne>
> indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1>
> stored,indexed,tokenized<label:Some One>
> stored,indexed,omitNorms,indexOptions=DOCS
> <uid:e7e369a1db7ff71723fda412d1f6308e1f71dd413621f0804ab97858af51196b>
> stored,indexed,tokenized<graph:http://example.org/G1>
> stored,indexed,omitNorms,indexOptions=DOCS
> <uid:50b49835488db84487e6e11287b570d7a9b8624fa714a9d51bf8ef444cc60bee>
> >
> Document<
> stored,indexed,indexOptions=DOCS<uri:http://example.org/SomeOne>
> indexed,omitNorms,indexOptions=DOCS<graph:http://example.org/G1>
> stored,indexed,tokenized<label:Some Neat One>
> stored,indexed,omitNorms,indexOptions=DOCS<lang:en>
> stored,indexed,omitNorms,indexOptions=DOCS
> <uid:2cf2b62a4a048d6517a0edddb0dabfdf190f4e074daf077b21a3844c5831376f>
> stored,indexed,tokenized<graph:http://example.org/G1>
> stored,indexed,omitNorms,indexOptions=DOCS<lang:en>
> stored,indexed,omitNorms,indexOptions=DOCS
> <uid:b5dbce956b7105e9c5424620330e5ec3a9d78e8c7d73cba5a880984fa2e89bfd>
> >
> {code}
> The {{graph}} field (and associated {{lang}} and {{uid}} fields) appear twice
> in each document. The initial occurrence results from the {{text:graphField}}
> configuration and the second is an artifact of
> {{TextQueryFuncs.entityFromQuad}} adding the graph to the {{Entity}} via
> {{entity.put(...)}}.
> This second occurrence of the graph field is not effective since there is no
> search over tokenized graph URIs and there is currently no way to return the
> graph field so no need to store it.
> It might well be a useful improvement to allow the graph field to be
> retrieved via {{text:query}} PF but that would most reasonably be done by
> adding the {{Field.Store.YES}} to the {{FieldType}} for the initial
> occurrence of the graph field.
> The second occurrence of a {{uid}} field is the result of the unnecessary
> graph occurrence resulting from the {{Entity}} to {{Document}} conversion in
> {{TextLuceneIndex}}. This is never used since the purpose of the {{uid}}
> field is to handle the deleting of documents from the Lucene index when a
> triple is deleted and does not involve the graph URI.
> The solution is to delete lines 89-90 of {{TextQueryFuncs}}.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)