Lucene/Solr and Jena

Frank Tanz Fri, 18 Feb 2011 07:02:05 -0800

Hi Paolo,

Our team has composed more details about the intentions surrounding this 
initiative. As requested, I am submitting this through the Jena-dev List.


Regards,
Frank


2/17/2011



Dear Paolo:


Thank you for your thoughtful response to our proposed project. First and 
foremost we want to acknowledge the complexity of this endeavor.  As graduate 
students, we are very enthusiastic about the opportunity to participate in an 
actual Open Source project. Keeping with the Open Source spirit and philosophy, 
we really appreciate your guidance pointing us to SIREN, LARQ, SARQ, and EARQ.
We have done some high level research into these four projects and have 
determined that while they present some interesting similarities into some of 
the underlying design aspects for parts of what our interface must do, none of 
these endeavors are striving to meet our vision. We thought SIREN was very 
interesting as it provides plugins to Lucene which provide the capability for 
Lucene to index and query RDF/XML graphs natively. However, this solution 
requires additional software parts in addition to Lucene and currently works 
outside of Jena with an existing persisted RDF/XML graph. LARQ and SARQ were 
also very attention-grabbing to research as the ability exists to build Lucene 
indexes within Jena as well as the ability to query the Lucene index from ARQ. 
Unfortunately the RDF/XML model must exist separate from the Lucene index. 
Lastly, we reviewed EARQ. This project seems to provide a layer of abstraction 
that would allow the developer to plug and play an available index mechanism 
such as Lucene or Solr. Unfortunately, this project would be feasible to build 
upon only if and when our interface could be achieved. While these four Open 
Source projects do not provide “plug and play” capabilities for our immediate 
purpose, they do provide some really good technical guidance for design 
discussions and decisions that we must make in the weeks to come.
We have included our Business Case for moving forward with this project. In 
addition to the Business Case, please understand that this project is more than 
just an academic exercise. Our academic advisor, Scott Streit serves as a CTO 
for a commercial corporation and has many clients utilizing Semantic Web 
applications. These clients include NITRD (The Networking and Information 
Technology Research and Development) and the U.S. Military. Upon successful 
completion of this interface, next steps would include transitioning the 
storage mechanisms that these clients currently use for RDF/XML graphs to 
Lucene/Solr structures.
Please review and approve this initiative so we can begin our design activities.

Sincerely,

The SolrStore Project Team:

Frank Tanz, Bharti Gupta, Bala Krishna Chitneni, Nimesh Shah

SolrStore Business Case:
It is our vision to add to Jena the capability to persist RDF/XML graphs by 
creating the data store directly within a Lucene inverted index structure. 
Simply said, our approach is to do this without the need for additional 
software parts and without using an additional RDBMS. While Jena’s existing 
ability to persist RDF/XML graphs to an RDBMS is a convenient storage choice, 
we argue that an RDBMS is not really appropriate for the Semantic Web, as 
transaction processing and normalized schemas are not part of the dynamic 
nature of the Semantic Web domain. Building upon this argument, the dynamic 
nature of the Semantic Web is better suited to use versioning in lieu of heavy 
duty transaction processing. It is our intent to exploit and leverage this 
inherent capability within Lucene and ultimately present it to Jena developers 
in an abstract way within the Jena API. Additionally, we believe that the 
Lucene/Solr indexing engine is underutilized in that it serves primarily as an 
index with pointers back to the original data source. We intend to not only use 
Lucene/Solr as an indexing engine, but also as the repository for the data 
source.
The prime directive for our project is to provide layers of abstraction between 
the Jena API and the Lucene/Solr API’s.  This commitment is extremely important 
to us as the complexities of our interface should not over burden a Jena 
developer who might have limited experience with the components within Lucene 
and Solr. We acknowledge that the use of an RDMBS to persist RDF/XML graphs 
within the Jena API was an innovative design choice for the timeframe of its 
creation. Our team’s objective is to evolve that innovation by building upon it 
with new technologies that are now available and accessible.
While this initiative is not a trivial task, we believe that the objective is 
important and if successful can benefit the Jena community.

---------- Forwarded message ----------
From: Paolo Castagna 
<[email protected]<mailto:[email protected]>>
Date: Wed, Feb 9, 2011 at 10:07 AM
Subject: Re: Fwd: Lucene/Solr and Jena
To: [email protected]<mailto:[email protected]>
Cc: Scott Streit <[email protected]<mailto:[email protected]>>


Hi Scott (hi all),
first of all, thank you for your email and nice to "meet" you. Even if
only via email, and even if we have never had the chance to interact
before. (We clearly have common contacts though!).

We (@Talis) use Lucene as well as Solr (as well as something else in the
future) to provide our free text search capabilities. However we do not
actually store RDF into Lucene indexes. For that, we use a "proper" RDF
store with SPARQL support which otherwise you will need to implement on
top of Lucene (and it's not a trivial task).

I am very interested in the topic of free text search in the context
of RDF and how free text searches can be 'integrated' with SPARQL.

I'd like to know more about your project plans and, indeed, your motivations.

I am not completely sure if your attachment made it to the jena-dev mailing
list. I have received the attachment anyway, since you added my work related
email (which I tend to try to protect from evil spammers) to the To: field.
I am subscribed to the 
[email protected]<mailto:[email protected]> mailing 
list, so we
can discuss here.

Coming back to the idea of "placing Lucene and Solr into Jena as persistent
store", can I suggest you take a look at SIREn [1]? There is a good chapter
(a case study) on the "Lucene in Action, Second Edition" book [2]. I really
recommend the book, it's a good one.
SIREn's aim is to use Lucene indexes to provide a complete storage system
for RDF, however I cannot possibly comment on the support for RDF store
APIs or their level of compliance in relation to SPARQL queries, for example.

A different approach it the one taken by LARQ [3] (and/or similar):

 "LARQ is a combination of ARQ and Lucene. It gives ARQ the ability to
 perform free text searches. Lucene indexes are additional information
 for accessing the RDF graph, not storage for the graph itself."
 -- http://openjena.org/ARQ/lucene-arq.html

LARQ is, at the moment, included in ARQ, but we have an open JIRA issue
(i.e. JENA-9 [4]) to separate it out as a separate module depending on ARQ.
A development version or LARQ as separate module, ready to be tested,
is available here: https://jena.svn.sourceforge.net/svnroot/jena/LARQ/trunk/
If you, or some of your students have time to try it, let me know if you
have problems with it.

As an experiment, I did a similar thing with Solr, it's called SARQ
and it's available here: https://github.com/castagna/SARQ.
Labeled "experimental (and unsupported)" since I did it out-of-band as
a proof of concept, but, because the design and functionalities are the
same as LARQ, it should not require a lot of effort to make it ready for
production. If others think this might be useful.

While, I was writing SARQ, I though: "wouldn't be nice to make it
extremely easy for developers to plug-in different indexing systems
such as Lucene, Solr or Elastic Search?". So, I gave it a go at EARQ.
It's available here: https://github.com/castagna/EARQ.
Again, it's labeled "experimental (and unsupported)", but if needed
and people are interested in it, it might require only little
improvements.

One of the biggest problem I had in relation to LARQ, SARQ and EARQ is
how to manage "deletes/removals". I've used a Jena Model as source for
a poor man's reference counting to decide when to remove a document
from the Lucene index. The source code should be clear on this.

Last but not least, in relation to part of the content of your attachment,
Jena is still in its incubating phase at Apache, but things work almost
the same as for the Apache Software Foundation. Please, have a look at
"How the ASF works" [5].

Let's keep the discussion flowing and invite your students to interact
with us on the jena-dev.

Let me know your motivations for wanting to store RDF in a Lucene/Solr
index.

Regarding the "cloud" references in your project proposal, we should
probably discuss it on a separate thread/message, always, on jena-dev.

Paolo


 [1] http://siren.sindice.com/
 [2] http://www.manning.com/hatcher3/
 [3] http://openjena.org/ARQ/lucene-arq.html
 [4] https://issues.apache.org/jira/browse/JENA-9
 [5] http://www.apache.org/foundation/how-it-works.html

Damian Steer wrote:
(I didn't get a moderation message about this, but Paolo was Ccd and forwarded 
to me. Is moderation working for anyone?)

Begin forwarded message:

---------- Forwarded message ----------
From: Scott Streit <[email protected]<mailto:[email protected]>>
Date: Wed, Feb 9, 2011 at 12:44 PM
Subject: Lucene/Solr and Jena
To: [email protected]<mailto:[email protected]>, 
[email protected]<mailto:[email protected]>


Jena-dev,

A group of my students at Villanova would like their Master's Degree
project to include placing lucene and solr into Jena as a persistent
store.  We are adding two more students.

Attached is an overall project plan.  Upon your approval, the next
step is a design document.

Scott Streit

Lucene/Solr and Jena

Reply via email to