On Wed, Feb 25, 2015 at 6:56 AM, Chris Dollin <[email protected]> wrote:
> On 02/25/2015 11:30 AM, Andy Seaborne wrote: > >> Final call for Jena 2.13.0. >>>> >>> > Stephen wrote: > > I finished up and commited some outstanding changes I had for jena-text. >>> I >>> added the ability to specify an analyzer for the query text itself that >>> was >>> different than the one used for the document. I also added some >>> documentation explaining it on the site. >>> >> >> Is there a JIRA for these changes? I have only a superficial >> understanding here >> but is any of this related to JENA-686? >> >> Stephen+Chris : maybe some discussion of plans and intentions on the dev@ >> list? >> > > Sure. I have some notes about what the 686 changes are about I can > transcribe. I have been making the (originally small) changes for > 686 compatible with master and have (rightly or wrongly) been delaying > discussion until I had something that seemed to be sound. > > Right Now I'm merging in the latest master changes and am expecting to > make a pull request this PM. > > I'm guessing that it's unlikely the changes will be reviewed in time > to make it into 2.13.0? > > The query analyzer change is pretty separate from JENA-686, it just exposes a capability that Lucene already has. This is useful for example if you are using the StandardAnalyzer to tokenize the stored document, but perhaps you want to use one that tokenizes the query string differently. You already could do this with jena-text's Solr implementation, since the configuration for that is controlled via the Solr config file. The conjunctive query idea of Chris' is also something I would look forward to. It actually looks like I may have implemented a feature that Chris needed, the ability to specify a custom TextDocProducer. Chris: I would be interested to see your approach for this. Are you planning on waiting until all statements have been inserted then querying the RDF store to regenerate the documents for subjects that have been changed? How do you handle triple deletion? I implemented the custom TextDocProducer for a slightly different reason, which was to handle triple deletions and remove the document from the lucene index. However, my triple deletion code is kind of a hack (I am only currently indexing rdfs:label, and my application enforces a cardinality of 1 for that property, so I can just delete all documents with a given subject and predicate). The index does not actually keep the value of the document, it only indexes it, so this solution would not work in the general case. I would propose in the future that we actual store and not just index the document so that it can be appropriately identified and deleted. This would require a change to existing Lucene databases (we should provide a tool to reindex existing data). An alternative to actually storing the value would be to generate a hash of the subject+predicate+object and store that as an identifier. Chris, I see in the JIRA that you talk about committing work to a branch, but I can't seem to locate it. Is this in github somewhere? -Stephen
