Re: Standardizing property functions and/or full text search in SPARQL

Paolo Castagna Sat, 10 Mar 2012 01:13:05 -0800

Frank Budinsky wrote:
> I'm trying to get a handle on the strategic implications of using Jena
> property functions, and specifically the LARQ textMatch property function
> approach for supporting full text search.


Hi Frank,
I have shared by comments in relation to free text and SPARQL in my other post.

Here, I'd like to better understand if you have any specific concern in 
relation to LARQ.

By the way, LARQ has not been released yet in Apache. It's ready to be released 
and I am trying to keep it aligned with lated Apache Lucene releases. (Note to 
myself: go and check if a new Apache
Lucene release is available, if so, update LARQ's dependency and test).
We were thinking to release Fuseki first and soon after LARQ.

In terms of functionalities, in the past LARQ used not to support "deletes", 
now it does (even if there might be better ways).
Let me explain. In the past if you added something to your Jena Model and that 
was added/indexed by LARQ in the Lucene indexes. But, if you removed something 
from your Jena Model, LARQ did not deleted
it from the Lucene indexes. In a scenario where you have mostly reads and 
updates are infrequent, that is fine. People, can rebuild the free text index 
nightly or every few hours. But ideally, you
want the free text index to be kept up-to-date with the RDF storage. This is 
way I am trying to achieve with LARQ. Ideally, a user just need to say: "use 
LARQ" and not to worry about building indexes
as data is added/removed to/from the RDF storage system.

One thing I am struggling with is try to capture all the possible paths where 
RDF data can change: APIs, SPARQL Update, bulk load, others?
In relation to SPARQL Update, for example, we have an issue still open: 
https://issues.apache.org/jira/browse/JENA-164

In addition to JENA-164 and once LARQ 1.0 is released in Apache, I'd like to 
think how to re-factor LARQ's code to make easier for people to plug-in 
different free text indexing systems, such as, just
to name a couple: Apache Solr and/or ElasticSearch 
(http://www.elasticsearch.org/). In relation to this, I have a couple of (now 
old) prototypes and proof of concept, here:
https://github.com/castagna/SARQ and here: https://github.com/castagna/EARQ
I use GitHub for experimental stuff and/or early prototypes... but if there is 
the need and demand, I do not see why those things cannot be moved into Apache 
Jena. And, indeed, it is my intention to
do so. See: https://issues.apache.org/jira/browse/JENA-17 (you can add comments 
there, watch an issue and/or vote for it). A vote, when it comes from not a 
committer, tells me someone else is
interested in that feature and he/she would like to see it implemented.
The big advantage of using Solr and/or ElasticSearch is that your indexes to 
answer free text queries can be remote and therefore you do not need to share 
RAM with the RDF storage layer. Solr and
ElasticSearch also provide a replicated/distributed solution, almost 
out-of-the-box.

Another thing to consider is the query syntax for free text search queries. 
Fortunately, Solr and ElasticSearch both use Lucene and there are not many 
differences in the basic syntax.
LARQ pass through the query string to Lucene, therefore, you have the full 
power of Lucene (sort of). Things start to get complicate once you through 
literals with multiple languages and/or you want
to support different analyzers. Maybe, this is another are where LARQ could be 
improved, if needed.

In general, from Jena, I'd like to enable (and made it as easy as possible) 
other developers to be notified, if they need to, as the underlying RDF data 
changes (i.e. new triples/quads added or new
triples/quads removed). I am unclear on what's the best way to achieve that, 
it's not that simple because there are multiple paths and also we do not want 
the handling of notification to affect
performances too much. This could obviously be used by LARQ, but also other 
custom indexes (i.e. GeoSPARQL, etc.). Others, might use notifications as the 
data changes to replicate data on remote
machines or different systems. Thinking about all this, also, I am not 
completely sure if it would be better to do this externally to Apache Jena. The 
big advantage of us, doing in Apache Jena is that
we can provide users with something out-of-the-box and they do not need to 
worry about it. We do it for them, if they need to be notified of changes, they 
listen to the events they are interested.
Easy for them.

In conclusion, Frank share with us your concern, needs and, if you have, 
feature requests and I hope, now, it is more explicit how I see things in 
relation to LARQ. Be careful, these are just my
opinions, in Apache, as in many open source projects, if you want/need 
something you need to be prepared to take actions and commit time/effort, etc. 
to it.
Apache Jena and Apache Lucene are two brilliant and useful softwares: LARQ 
joins them together in a simple and useful way. I like it.

Paolo

PS:
There is another thing, in relation to LARQ and/or free text, I've been 
thinking of... it relates to query optimisation and deciding when to query the 
free text index first or not. In practice,
Lucene, Sorl and ElasticSearch are amazingly fast and continue to improve: hit 
them first is my advice. :-)

Re: Standardizing property functions and/or full text search in SPARQL

Reply via email to