Re: Lucene/Solr and Jena

Paolo Castagna Fri, 18 Feb 2011 07:46:02 -0800

Hi Frank,
nice to meet you and I am glad you wrote your reply to the jena-dev mailing 
list.
My comments are inline.


Frank Tanz wrote:
> Hi Paolo,
>
> Our team has composed more details about the intentions surrounding this 
initiative. As requested, I am submitting this through the Jena-dev List.
>
> Regards,
> Frank
>
>
> 2/17/2011
>
>
>
> Dear Paolo:
>
>

> Thank you for your thoughtful response to our proposed project. First and foremost we want to acknowledge the complexity of this endeavor. As graduate students, we are very enthusiastic about theopportunity to participate in an actual Open Source project.


As I already written, you should really read:
http://www.apache.org/foundation/how-it-works.html
I'll continue to point you at it, every time I have the suspect there
is a misunderstanding on that. :-) If my suspect is wrong, better.
I apologize.

> Keeping with the Open Source spirit and philosophy, we really appreciate your 
guidance pointing us to SIREN, LARQ, SARQ, and EARQ.

> We have done some high level research into these four projects and have determined that while they present some interesting similarities into some of the underlying design aspects for parts of whatour interface must do, none of these endeavors are striving to meet our vision. We thought SIREN was very interesting as it provides plugins to Lucene which provide the capability for Lucene to indexand query RDF/XML graphs natively. However, this solution requires additional software parts in addition to Lucene and currently works outside of Jena with an existing persisted RDF/XML graph. LARQand SARQ were also very attention-grabbing to research as the ability exists to build Lucene indexes within Jena as well as the ability to query the Lucene index from ARQ. Unfortunately the RDF/XMLmodel must exist separate from the Lucene index. Lastly, we reviewed EARQ. This project seems to provide a layer of abstraction that would allow the developer to plug and play an available indexmechanism such as Lucene or Solr. Unfortunately, this project would be feasible to build upon only if and when our interface could be achieved. While these four Open Source projects do not provide“plug and play” capabilities for our immediate purpose, they do provide some really good technical guidance for design discussions and decisions that we must make in the weeks to come.


There isn't a fundamental difference between LARQ, SARQ or EARQ. They
all provide similar funcitonalities, they are just a proof of concept
of a possible evolutionary path for LARQ within the Jena project.

From your comment, you are confirming my impression that you want to
actually store RDF data into Lucene and implement a Jena graph over
it. This is not what LARQ (or SARQ or EARQ) do. They assume you use
Lucene only for free text searches, therefore they index only literals.

SIREn is probably closer to what you want to do. What are the additional
software parts you refer to?

What makes you think that storing RDF data into Lucene will give you
better performances than a native RDF store such as, say, TDB?

How are you planning to evaluate performances of your solution?

May I suggest BSBM and TDB as your baseline?
http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/

> We have included our Business Case for moving forward with this project. In addition to the Business Case, please understand that this project is more than just an academic exercise. Our academicadvisor, Scott Streit serves as a CTO for a commercial corporation and has many clients utilizing Semantic Web applications. These clients include NITRD (The Networking and Information TechnologyResearch and Development) and the U.S. Military. Upon successful completion of this interface, next steps would include transitioning the storage mechanisms that these clients currently use forRDF/XML graphs to Lucene/Solr structures.

> Please review and approve this initiative so we can begin our design 
activities.

I can read your message and share my opinions or give technical suggestions
on projects I have used, but it's not my role reviewing business cases or
approving initiatives of people who want to do something interesting with Jena.

Once again, I point you at:
http://www.apache.org/foundation/how-it-works.html

>
> Sincerely,
>
> The SolrStore Project Team:
>
> Frank Tanz, Bharti Gupta, Bala Krishna Chitneni, Nimesh Shah
>
> SolrStore Business Case:

> It is our vision to add to Jena the capability to persist RDF/XML graphs by creating the data store directly within a Lucene inverted index structure. Simply said, our approach is to do thiswithout the need for additional software parts and without using an additional RDBMS.


TDB is a native RDF storage system for Jena and it does not use RDBMS.
What's make you think a solution to store directly RDF data in Lucene
will be faster/better?

Don't get me wrong, I am not sure it will or it won't and I am myself
curious about it. But, I have doubts. If I were you, I would try to
quickly prove it's possible to achieve better performances with a small
prototype.

> While Jena’s existing ability to persist RDF/XML graphs to an RDBMS is a convenient storage choice, we argue that an RDBMS is not really appropriate for the Semantic Web, as transaction processingand normalized schemas are not part of the dynamic nature of the Semantic Web domain. Building upon this argument, the dynamic nature of the Semantic Web is better suited to use versioning in lieu ofheavy duty transaction processing. It is our intent to exploit and leverage this inherent capability within Lucene and ultimately present it to Jena developers in an abstract way within the Jena API.Additionally, we believe that the Lucene/Solr indexing engine is underutilized in that it serves primarily as an index with pointers back to the original data source. We intend to not only useLucene/Solr as an indexing engine, but also as the repository for the data source.> The prime directive for our project is to provide layers of abstraction between the Jena API and the Lucene/Solr API’s. This commitment is extremely important to us as the complexities of ourinterface should not over burden a Jena developer who might have limited experience with the components within Lucene and Solr. We acknowledge that the use of an RDMBS to persist RDF/XML graphs withinthe Jena API was an innovative design choice for the timeframe of its creation. Our team’s objective is to evolve that innovation by building upon it with new technologies that are now available andaccessible.


If I understand correctly your motivations/rational in wanting to try to
store RDF in a Lucene index is because a sort of discontent with solutions
which use RDBMS.

However, you have not mentioned or looked at a native RDF storage system
which comes with Jena.

  """
  There are two subsystems for persisting RDF and OWL data, SDB or TDB.
  These are separate downloads.

  TDB is a high-performance, native persistence engine using custom
  indexing and storage. SDB is a persistence layer that uses an SQL
  database and supports full ACID transactions.
  TDB is faster and simpler to setup.

    * TDB documentation: http://openjena.org/wiki/TDB
    * SDB documentation: http://openjena.org/wiki/SDB

  The original RDB system is still shipped with Jena for legacy
  applications. It is deprecated for new development.
  """
  -- http://www.openjena.org/documentation.html

So, please, have a look at the TDB documentation, try to install and use it:

 - http://openjena.org/wiki/TDB

> While this initiative is not a trivial task, we believe that the objective is 
important and if successful can benefit the Jena community.

It's true that what you want to do is not trivial.

However, IMHO, you should be able to proof with a quick prototype of a
Jena Graph SPI that an RDF storage solution over Lucene indexes is faster
than what's already there (in particular TDB). Then you probably get some
more attention.

I don't think we have specific documentation to guide you on how to
put Jena over a different store/indexing system (in this case Lucene)
implementing the Graph SPI. Have we?

I can point you at these, though:

 - 
https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/
   ... see GraphTDB and GraphTDBBase
 - https://github.com/castagna/hbase-rdf
   ... this copied the TDB approach, but I'd like to see how things could
   work over HBase (it's not finished/working yet).
 - http://openjena.org/ARQ/arq-query-eval.html

HTH,
Paolo

>
> ---------- Forwarded message ----------
> From: Paolo Castagna 
<[email protected]<mailto:[email protected]>>
> Date: Wed, Feb 9, 2011 at 10:07 AM
> Subject: Re: Fwd: Lucene/Solr and Jena
> To: [email protected]<mailto:[email protected]>
> Cc: Scott Streit <[email protected]<mailto:[email protected]>>
>
>
> Hi Scott (hi all),
> first of all, thank you for your email and nice to "meet" you. Even if
> only via email, and even if we have never had the chance to interact
> before. (We clearly have common contacts though!).
>
> We (@Talis) use Lucene as well as Solr (as well as something else in the
> future) to provide our free text search capabilities. However we do not
> actually store RDF into Lucene indexes. For that, we use a "proper" RDF
> store with SPARQL support which otherwise you will need to implement on
> top of Lucene (and it's not a trivial task).
>
> I am very interested in the topic of free text search in the context
> of RDF and how free text searches can be 'integrated' with SPARQL.
>
> I'd like to know more about your project plans and, indeed, your motivations.
>
> I am not completely sure if your attachment made it to the jena-dev mailing
> list. I have received the attachment anyway, since you added my work related
> email (which I tend to try to protect from evil spammers) to the To: field.
> I am subscribed to the 
[email protected]<mailto:[email protected]> mailing list, 
so we
> can discuss here.
>
> Coming back to the idea of "placing Lucene and Solr into Jena as persistent
> store", can I suggest you take a look at SIREn [1]? There is a good chapter
> (a case study) on the "Lucene in Action, Second Edition" book [2]. I really
> recommend the book, it's a good one.
> SIREn's aim is to use Lucene indexes to provide a complete storage system
> for RDF, however I cannot possibly comment on the support for RDF store
> APIs or their level of compliance in relation to SPARQL queries, for example.
>
> A different approach it the one taken by LARQ [3] (and/or similar):
>
>  "LARQ is a combination of ARQ and Lucene. It gives ARQ the ability to
>  perform free text searches. Lucene indexes are additional information
>  for accessing the RDF graph, not storage for the graph itself."
>  -- http://openjena.org/ARQ/lucene-arq.html
>
> LARQ is, at the moment, included in ARQ, but we have an open JIRA issue
> (i.e. JENA-9 [4]) to separate it out as a separate module depending on ARQ.
> A development version or LARQ as separate module, ready to be tested,
> is available here: https://jena.svn.sourceforge.net/svnroot/jena/LARQ/trunk/
> If you, or some of your students have time to try it, let me know if you
> have problems with it.
>
> As an experiment, I did a similar thing with Solr, it's called SARQ
> and it's available here: https://github.com/castagna/SARQ.
> Labeled "experimental (and unsupported)" since I did it out-of-band as
> a proof of concept, but, because the design and functionalities are the
> same as LARQ, it should not require a lot of effort to make it ready for
> production. If others think this might be useful.
>
> While, I was writing SARQ, I though: "wouldn't be nice to make it
> extremely easy for developers to plug-in different indexing systems
> such as Lucene, Solr or Elastic Search?". So, I gave it a go at EARQ.
> It's available here: https://github.com/castagna/EARQ.
> Again, it's labeled "experimental (and unsupported)", but if needed
> and people are interested in it, it might require only little
> improvements.
>
> One of the biggest problem I had in relation to LARQ, SARQ and EARQ is
> how to manage "deletes/removals". I've used a Jena Model as source for
> a poor man's reference counting to decide when to remove a document
> from the Lucene index. The source code should be clear on this.
>
> Last but not least, in relation to part of the content of your attachment,
> Jena is still in its incubating phase at Apache, but things work almost
> the same as for the Apache Software Foundation. Please, have a look at
> "How the ASF works" [5].
>
> Let's keep the discussion flowing and invite your students to interact
> with us on the jena-dev.
>
> Let me know your motivations for wanting to store RDF in a Lucene/Solr
> index.
>
> Regarding the "cloud" references in your project proposal, we should
> probably discuss it on a separate thread/message, always, on jena-dev.
>
> Paolo
>
>
>  [1] http://siren.sindice.com/
>  [2] http://www.manning.com/hatcher3/
>  [3] http://openjena.org/ARQ/lucene-arq.html
>  [4] https://issues.apache.org/jira/browse/JENA-9
>  [5] http://www.apache.org/foundation/how-it-works.html
>
> Damian Steer wrote:
> (I didn't get a moderation message about this, but Paolo was Ccd and 
forwarded to me. Is moderation working for anyone?)
>
> Begin forwarded message:
>
> ---------- Forwarded message ----------
> From: Scott Streit <[email protected]<mailto:[email protected]>>
> Date: Wed, Feb 9, 2011 at 12:44 PM
> Subject: Lucene/Solr and Jena
> To: [email protected]<mailto:[email protected]>, 
[email protected]<mailto:[email protected]>
>
>
> Jena-dev,
>
> A group of my students at Villanova would like their Master's Degree
> project to include placing lucene and solr into Jena as a persistent
> store.  We are adding two more students.
>
> Attached is an overall project plan.  Upon your approval, the next
> step is a design document.
>
> Scott Streit
>
>

Re: Lucene/Solr and Jena

Reply via email to