GSoC 2013 - Apache Solr backend for Apache Sling

Ilya Velesevich Tue, 30 Apr 2013 02:15:55 -0700

Hi Everyone,

I‘m working on proposal for “Apache Solr backend for Apache Sling” task as
part of Google Summer of Code 2013 –
https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading
articles/watching videos/looking through source code to investigate the
topic in more depth. Now I want to describe my vision on task and
implementation approach. All your comments/suggestions would be very
helpful in order to improve my proposal and bring more value of
implementing the task.


I see several parts of the task.

*1.       **Provide CRUDL operations for Solr data through Sling API.*

This will allow creating Sling resources residing in Solr server and
querying them through Sling API using Solr search capabilities. Solr query
syntax should be used for queries.

>From Sling API perspective custom *ResourceProvider *(and *Resource*)
implementation will be created additionally implementing *
QueriableResourceProvider* and *ModifyingResourceProvider*. (If necessary *
RefreshableResourceProvider* and *DynamicResourceProvider* interfaces will
also be implemented). To communicate with Solr server Solrj API will be
used.

*2.       **Provide convenient ways to create Solr resources based on
different data.*

*2.1.    **Create Solr resource based on arbitrary Sling resource*. This
will allow adding Sling resources to Solr server for efficient search. The
created Solr resource will also hold a reference (most likely, resource
path) to the original Sling resource. The *Adaptable* concept seems to be a
reasonable way of implementing this functionality – to “convert” arbitrary
Sling resource to Solr resource and resolve original Sling resource based
on Solr resource.

Also I think that not all metadata of Sling resource should be used when
creating corresponding Solr resource – so this task should also include
some configuration to specify metadata necessary to be passed to Solr
resource. Additionally, some transformations on resource metadata could be
supported here.

* 2.2.    *When creating Solr resources not all data could be efficiently
stored in Solr – for instance, large binary files. If this is the
situation, one could create Sling resource (for instance, FileSystem or
Jackrabbit) and then create Solr resource based on that Sling resource –
this’ll allow both efficient search through Solr and effective storing
options. As an optimization, these steps could be done automatically based
on some configuration. So *when Solr resource is created we could analyze it
* (analyze metadata, trying to adapt to certain types) *and create
additional supporting resources in other parts of Sling virtual resource
tree if necessary*. What do you think – is it necessary to implement such
functionality or 2.1 option will be sufficient? What useful scenarios do
you see for this task besides the “large binary” scenario?

*3.       **Provide solution to support search for arbitrary Sling
resources through Sling API using Solr capabilities.*

>From my point of view this one needs some external solutions to support
things like full index, incremental index, creating different schedules,
etc. I see that Solr DataImportHandler or Apache ManifoldCF could be
utilized for this task. So the concept of solution here would be to write
necessary implementation so that Sling virtual resource tree could be used
as a data source for one of the components mentioned above. What do you
think about this approach? Could you advice some other alternatives to Solr
DataImportHandler and Apache ManifoldCF for implementing this task?



Also I’ve got couple of questions on Sling API:

   - Am I right that the “best practice” way to provide bundle with custom *
   ResourceProvider* implementation is to use Apache Felix Maven SCR Plugin
   and specify certain SCR annotations (like *@Component*, *@Service* and
   some others) on corresponding classes – *ResourceProvider* or *
   ResourceProviderFactory* implementation in this case?


   - I see that *ResourceResolver* is intended to be used by clients to
   obtain and work with Sling resources. Also it seems to me that it is
   unlikely necessary to create custom *ResourceResolver* implementation
   for the Solr integration task. But still, could you please specify some
   valid typical cases when one would need to create custom *
   ResourceResolver*?


   - Suppose I have configured same resource provider implementation (like
   file system resource provider or possible Solr resource provider) under two
   urls “/url1” and “/url2”. Now I want to perform *findResources*/*
   queryResources* but only for the resources residing under “/url1”. Is it
   possible to limit search results in such way? (Probably I missed something,
   but looking through source code it seems that query results from all
   queriable resource providers supporting given query language will be
   combined regardless where in the resource tree corresponding provider is
   configured)



Please write any feedback/thoughts you have after reading this vision –
this’ll really help me to understand details further.



Many thanks in advance,

Ilya

GSoC 2013 - Apache Solr backend for Apache Sling

Reply via email to