Hi Paolo, Our team has composed more details about the intentions surrounding this initiative. As requested, I am submitting this through the Jena-dev List.
Regards, Frank 2/17/2011 Dear Paolo: Thank you for your thoughtful response to our proposed project. First and foremost we want to acknowledge the complexity of this endeavor. As graduate students, we are very enthusiastic about the opportunity to participate in an actual Open Source project. Keeping with the Open Source spirit and philosophy, we really appreciate your guidance pointing us to SIREN, LARQ, SARQ, and EARQ. We have done some high level research into these four projects and have determined that while they present some interesting similarities into some of the underlying design aspects for parts of what our interface must do, none of these endeavors are striving to meet our vision. We thought SIREN was very interesting as it provides plugins to Lucene which provide the capability for Lucene to index and query RDF/XML graphs natively. However, this solution requires additional software parts in addition to Lucene and currently works outside of Jena with an existing persisted RDF/XML graph. LARQ and SARQ were also very attention-grabbing to research as the ability exists to build Lucene indexes within Jena as well as the ability to query the Lucene index from ARQ. Unfortunately the RDF/XML model must exist separate from the Lucene index. Lastly, we reviewed EARQ. This project seems to provide a layer of abstraction that would allow the developer to plug and play an available index mechanism such as Lucene or Solr. Unfortunately, this project would be feasible to build upon only if and when our interface could be achieved. While these four Open Source projects do not provide “plug and play” capabilities for our immediate purpose, they do provide some really good technical guidance for design discussions and decisions that we must make in the weeks to come. We have included our Business Case for moving forward with this project. In addition to the Business Case, please understand that this project is more than just an academic exercise. Our academic advisor, Scott Streit serves as a CTO for a commercial corporation and has many clients utilizing Semantic Web applications. These clients include NITRD (The Networking and Information Technology Research and Development) and the U.S. Military. Upon successful completion of this interface, next steps would include transitioning the storage mechanisms that these clients currently use for RDF/XML graphs to Lucene/Solr structures. Please review and approve this initiative so we can begin our design activities. Sincerely, The SolrStore Project Team: Frank Tanz, Bharti Gupta, Bala Krishna Chitneni, Nimesh Shah SolrStore Business Case: It is our vision to add to Jena the capability to persist RDF/XML graphs by creating the data store directly within a Lucene inverted index structure. Simply said, our approach is to do this without the need for additional software parts and without using an additional RDBMS. While Jena’s existing ability to persist RDF/XML graphs to an RDBMS is a convenient storage choice, we argue that an RDBMS is not really appropriate for the Semantic Web, as transaction processing and normalized schemas are not part of the dynamic nature of the Semantic Web domain. Building upon this argument, the dynamic nature of the Semantic Web is better suited to use versioning in lieu of heavy duty transaction processing. It is our intent to exploit and leverage this inherent capability within Lucene and ultimately present it to Jena developers in an abstract way within the Jena API. Additionally, we believe that the Lucene/Solr indexing engine is underutilized in that it serves primarily as an index with pointers back to the original data source. We intend to not only use Lucene/Solr as an indexing engine, but also as the repository for the data source. The prime directive for our project is to provide layers of abstraction between the Jena API and the Lucene/Solr API’s. This commitment is extremely important to us as the complexities of our interface should not over burden a Jena developer who might have limited experience with the components within Lucene and Solr. We acknowledge that the use of an RDMBS to persist RDF/XML graphs within the Jena API was an innovative design choice for the timeframe of its creation. Our team’s objective is to evolve that innovation by building upon it with new technologies that are now available and accessible. While this initiative is not a trivial task, we believe that the objective is important and if successful can benefit the Jena community. ---------- Forwarded message ---------- From: Paolo Castagna <[email protected]<mailto:[email protected]>> Date: Wed, Feb 9, 2011 at 10:07 AM Subject: Re: Fwd: Lucene/Solr and Jena To: [email protected]<mailto:[email protected]> Cc: Scott Streit <[email protected]<mailto:[email protected]>> Hi Scott (hi all), first of all, thank you for your email and nice to "meet" you. Even if only via email, and even if we have never had the chance to interact before. (We clearly have common contacts though!). We (@Talis) use Lucene as well as Solr (as well as something else in the future) to provide our free text search capabilities. However we do not actually store RDF into Lucene indexes. For that, we use a "proper" RDF store with SPARQL support which otherwise you will need to implement on top of Lucene (and it's not a trivial task). I am very interested in the topic of free text search in the context of RDF and how free text searches can be 'integrated' with SPARQL. I'd like to know more about your project plans and, indeed, your motivations. I am not completely sure if your attachment made it to the jena-dev mailing list. I have received the attachment anyway, since you added my work related email (which I tend to try to protect from evil spammers) to the To: field. I am subscribed to the [email protected]<mailto:[email protected]> mailing list, so we can discuss here. Coming back to the idea of "placing Lucene and Solr into Jena as persistent store", can I suggest you take a look at SIREn [1]? There is a good chapter (a case study) on the "Lucene in Action, Second Edition" book [2]. I really recommend the book, it's a good one. SIREn's aim is to use Lucene indexes to provide a complete storage system for RDF, however I cannot possibly comment on the support for RDF store APIs or their level of compliance in relation to SPARQL queries, for example. A different approach it the one taken by LARQ [3] (and/or similar): "LARQ is a combination of ARQ and Lucene. It gives ARQ the ability to perform free text searches. Lucene indexes are additional information for accessing the RDF graph, not storage for the graph itself." -- http://openjena.org/ARQ/lucene-arq.html LARQ is, at the moment, included in ARQ, but we have an open JIRA issue (i.e. JENA-9 [4]) to separate it out as a separate module depending on ARQ. A development version or LARQ as separate module, ready to be tested, is available here: https://jena.svn.sourceforge.net/svnroot/jena/LARQ/trunk/ If you, or some of your students have time to try it, let me know if you have problems with it. As an experiment, I did a similar thing with Solr, it's called SARQ and it's available here: https://github.com/castagna/SARQ. Labeled "experimental (and unsupported)" since I did it out-of-band as a proof of concept, but, because the design and functionalities are the same as LARQ, it should not require a lot of effort to make it ready for production. If others think this might be useful. While, I was writing SARQ, I though: "wouldn't be nice to make it extremely easy for developers to plug-in different indexing systems such as Lucene, Solr or Elastic Search?". So, I gave it a go at EARQ. It's available here: https://github.com/castagna/EARQ. Again, it's labeled "experimental (and unsupported)", but if needed and people are interested in it, it might require only little improvements. One of the biggest problem I had in relation to LARQ, SARQ and EARQ is how to manage "deletes/removals". I've used a Jena Model as source for a poor man's reference counting to decide when to remove a document from the Lucene index. The source code should be clear on this. Last but not least, in relation to part of the content of your attachment, Jena is still in its incubating phase at Apache, but things work almost the same as for the Apache Software Foundation. Please, have a look at "How the ASF works" [5]. Let's keep the discussion flowing and invite your students to interact with us on the jena-dev. Let me know your motivations for wanting to store RDF in a Lucene/Solr index. Regarding the "cloud" references in your project proposal, we should probably discuss it on a separate thread/message, always, on jena-dev. Paolo [1] http://siren.sindice.com/ [2] http://www.manning.com/hatcher3/ [3] http://openjena.org/ARQ/lucene-arq.html [4] https://issues.apache.org/jira/browse/JENA-9 [5] http://www.apache.org/foundation/how-it-works.html Damian Steer wrote: (I didn't get a moderation message about this, but Paolo was Ccd and forwarded to me. Is moderation working for anyone?) Begin forwarded message: ---------- Forwarded message ---------- From: Scott Streit <[email protected]<mailto:[email protected]>> Date: Wed, Feb 9, 2011 at 12:44 PM Subject: Lucene/Solr and Jena To: [email protected]<mailto:[email protected]>, [email protected]<mailto:[email protected]> Jena-dev, A group of my students at Villanova would like their Master's Degree project to include placing lucene and solr into Jena as a persistent store. We are adding two more students. Attached is an overall project plan. Upon your approval, the next step is a design document. Scott Streit
