Frank Budinsky wrote: > I'm trying to get a handle on the strategic implications of using Jena > property functions, and specifically the LARQ textMatch property function > approach for supporting full text search.
Hi Frank, I have shared by comments in relation to free text and SPARQL in my other post. Here, I'd like to better understand if you have any specific concern in relation to LARQ. By the way, LARQ has not been released yet in Apache. It's ready to be released and I am trying to keep it aligned with lated Apache Lucene releases. (Note to myself: go and check if a new Apache Lucene release is available, if so, update LARQ's dependency and test). We were thinking to release Fuseki first and soon after LARQ. In terms of functionalities, in the past LARQ used not to support "deletes", now it does (even if there might be better ways). Let me explain. In the past if you added something to your Jena Model and that was added/indexed by LARQ in the Lucene indexes. But, if you removed something from your Jena Model, LARQ did not deleted it from the Lucene indexes. In a scenario where you have mostly reads and updates are infrequent, that is fine. People, can rebuild the free text index nightly or every few hours. But ideally, you want the free text index to be kept up-to-date with the RDF storage. This is way I am trying to achieve with LARQ. Ideally, a user just need to say: "use LARQ" and not to worry about building indexes as data is added/removed to/from the RDF storage system. One thing I am struggling with is try to capture all the possible paths where RDF data can change: APIs, SPARQL Update, bulk load, others? In relation to SPARQL Update, for example, we have an issue still open: https://issues.apache.org/jira/browse/JENA-164 In addition to JENA-164 and once LARQ 1.0 is released in Apache, I'd like to think how to re-factor LARQ's code to make easier for people to plug-in different free text indexing systems, such as, just to name a couple: Apache Solr and/or ElasticSearch (http://www.elasticsearch.org/). In relation to this, I have a couple of (now old) prototypes and proof of concept, here: https://github.com/castagna/SARQ and here: https://github.com/castagna/EARQ I use GitHub for experimental stuff and/or early prototypes... but if there is the need and demand, I do not see why those things cannot be moved into Apache Jena. And, indeed, it is my intention to do so. See: https://issues.apache.org/jira/browse/JENA-17 (you can add comments there, watch an issue and/or vote for it). A vote, when it comes from not a committer, tells me someone else is interested in that feature and he/she would like to see it implemented. The big advantage of using Solr and/or ElasticSearch is that your indexes to answer free text queries can be remote and therefore you do not need to share RAM with the RDF storage layer. Solr and ElasticSearch also provide a replicated/distributed solution, almost out-of-the-box. Another thing to consider is the query syntax for free text search queries. Fortunately, Solr and ElasticSearch both use Lucene and there are not many differences in the basic syntax. LARQ pass through the query string to Lucene, therefore, you have the full power of Lucene (sort of). Things start to get complicate once you through literals with multiple languages and/or you want to support different analyzers. Maybe, this is another are where LARQ could be improved, if needed. In general, from Jena, I'd like to enable (and made it as easy as possible) other developers to be notified, if they need to, as the underlying RDF data changes (i.e. new triples/quads added or new triples/quads removed). I am unclear on what's the best way to achieve that, it's not that simple because there are multiple paths and also we do not want the handling of notification to affect performances too much. This could obviously be used by LARQ, but also other custom indexes (i.e. GeoSPARQL, etc.). Others, might use notifications as the data changes to replicate data on remote machines or different systems. Thinking about all this, also, I am not completely sure if it would be better to do this externally to Apache Jena. The big advantage of us, doing in Apache Jena is that we can provide users with something out-of-the-box and they do not need to worry about it. We do it for them, if they need to be notified of changes, they listen to the events they are interested. Easy for them. In conclusion, Frank share with us your concern, needs and, if you have, feature requests and I hope, now, it is more explicit how I see things in relation to LARQ. Be careful, these are just my opinions, in Apache, as in many open source projects, if you want/need something you need to be prepared to take actions and commit time/effort, etc. to it. Apache Jena and Apache Lucene are two brilliant and useful softwares: LARQ joins them together in a simple and useful way. I like it. Paolo PS: There is another thing, in relation to LARQ and/or free text, I've been thinking of... it relates to query optimisation and deciding when to query the free text index first or not. In practice, Lucene, Sorl and ElasticSearch are amazingly fast and continue to improve: hit them first is my advice. :-)
