Github user osma commented on the pull request:

    https://github.com/apache/jena/pull/53#issuecomment-97699765
  
    Hi Alexis! Thanks for the code!
    
    This just screams for unit tests that show that it actually works in all 
cases...
    
    I'm a bit worried about the BORDER_DELIMITER trick used here. It seems very 
fragile. For example, what happens if I have these two triples indexed using 
jena-text:
    
    ex:paris rdfs:label "Paris"@fr .
    ex:paris rdfs:label "Paris"@en .
    
    Then the second triple is deleted. Will both entries be dropped from the 
Lucene index?
    
    Similar things may happen for variants that differ only in letter case 
(which will be folded to lowercase by most analyzers), or singular/plural forms 
that get stemmed into the same base form. These may all cause false matches 
when entries are about to be deleted.
    
    If I had to implement something like this, I'd instead try to store the 
original literal in the Lucene index, as intact as possible. In practice, 
putting the string value and language tag in separate fields which store the 
values without tokenizing (like the uri and graph fields currently). Then at 
deletion time it would be easy to query for the exact same entry and remove it, 
and only it, from the Lucene index.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to