Github user osma commented on the pull request: https://github.com/apache/jena/pull/53#issuecomment-97699765 Hi Alexis! Thanks for the code! This just screams for unit tests that show that it actually works in all cases... I'm a bit worried about the BORDER_DELIMITER trick used here. It seems very fragile. For example, what happens if I have these two triples indexed using jena-text: ex:paris rdfs:label "Paris"@fr . ex:paris rdfs:label "Paris"@en . Then the second triple is deleted. Will both entries be dropped from the Lucene index? Similar things may happen for variants that differ only in letter case (which will be folded to lowercase by most analyzers), or singular/plural forms that get stemmed into the same base form. These may all cause false matches when entries are about to be deleted. If I had to implement something like this, I'd instead try to store the original literal in the Lucene index, as intact as possible. In practice, putting the string value and language tag in separate fields which store the values without tokenizing (like the uri and graph fields currently). Then at deletion time it would be easy to query for the exact same entry and remove it, and only it, from the Lucene index.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---