Hi Everyone, I‘m working on proposal for “Apache Solr backend for Apache Sling” task as part of Google Summer of Code 2013 – https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading articles/watching videos/looking through source code to investigate the topic in more depth. Now I want to describe my vision on task and implementation approach. All your comments/suggestions would be very helpful in order to improve my proposal and bring more value of implementing the task.
I see several parts of the task. *1. **Provide CRUDL operations for Solr data through Sling API.* This will allow creating Sling resources residing in Solr server and querying them through Sling API using Solr search capabilities. Solr query syntax should be used for queries. >From Sling API perspective custom *ResourceProvider *(and *Resource*) implementation will be created additionally implementing * QueriableResourceProvider* and *ModifyingResourceProvider*. (If necessary * RefreshableResourceProvider* and *DynamicResourceProvider* interfaces will also be implemented). To communicate with Solr server Solrj API will be used. *2. **Provide convenient ways to create Solr resources based on different data.* *2.1. **Create Solr resource based on arbitrary Sling resource*. This will allow adding Sling resources to Solr server for efficient search. The created Solr resource will also hold a reference (most likely, resource path) to the original Sling resource. The *Adaptable* concept seems to be a reasonable way of implementing this functionality – to “convert” arbitrary Sling resource to Solr resource and resolve original Sling resource based on Solr resource. Also I think that not all metadata of Sling resource should be used when creating corresponding Solr resource – so this task should also include some configuration to specify metadata necessary to be passed to Solr resource. Additionally, some transformations on resource metadata could be supported here. * 2.2. *When creating Solr resources not all data could be efficiently stored in Solr – for instance, large binary files. If this is the situation, one could create Sling resource (for instance, FileSystem or Jackrabbit) and then create Solr resource based on that Sling resource – this’ll allow both efficient search through Solr and effective storing options. As an optimization, these steps could be done automatically based on some configuration. So *when Solr resource is created we could analyze it * (analyze metadata, trying to adapt to certain types) *and create additional supporting resources in other parts of Sling virtual resource tree if necessary*. What do you think – is it necessary to implement such functionality or 2.1 option will be sufficient? What useful scenarios do you see for this task besides the “large binary” scenario? *3. **Provide solution to support search for arbitrary Sling resources through Sling API using Solr capabilities.* >From my point of view this one needs some external solutions to support things like full index, incremental index, creating different schedules, etc. I see that Solr DataImportHandler or Apache ManifoldCF could be utilized for this task. So the concept of solution here would be to write necessary implementation so that Sling virtual resource tree could be used as a data source for one of the components mentioned above. What do you think about this approach? Could you advice some other alternatives to Solr DataImportHandler and Apache ManifoldCF for implementing this task? Also I’ve got couple of questions on Sling API: - Am I right that the “best practice” way to provide bundle with custom * ResourceProvider* implementation is to use Apache Felix Maven SCR Plugin and specify certain SCR annotations (like *@Component*, *@Service* and some others) on corresponding classes – *ResourceProvider* or * ResourceProviderFactory* implementation in this case? - I see that *ResourceResolver* is intended to be used by clients to obtain and work with Sling resources. Also it seems to me that it is unlikely necessary to create custom *ResourceResolver* implementation for the Solr integration task. But still, could you please specify some valid typical cases when one would need to create custom * ResourceResolver*? - Suppose I have configured same resource provider implementation (like file system resource provider or possible Solr resource provider) under two urls “/url1” and “/url2”. Now I want to perform *findResources*/* queryResources* but only for the resources residing under “/url1”. Is it possible to limit search results in such way? (Probably I missed something, but looking through source code it seems that query results from all queriable resource providers supporting given query language will be combined regardless where in the resource tree corresponding provider is configured) Please write any feedback/thoughts you have after reading this vision – this’ll really help me to understand details further. Many thanks in advance, Ilya