Re: GSoC 2013 - Apache Solr backend for Apache Sling

Ian Boston Tue, 30 Apr 2013 15:05:49 -0700

Hi
Some comments in line,
but please remember to submit this proposal at the GSoC site so that it can
be reviewed.
The deadline is


3rd May 2013

Ie this Friday.

Ian
(More below).


On 30 April 2013 19:15, Ilya Velesevich <[email protected]> wrote:

> Hi Everyone,
>
> I‘m working on proposal for “Apache Solr backend for Apache Sling” task as
> part of Google Summer of Code 2013 –
> https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading
> articles/watching videos/looking through source code to investigate the
> topic in more depth. Now I want to describe my vision on task and
> implementation approach. All your comments/suggestions would be very
> helpful in order to improve my proposal and bring more value of
> implementing the task.
>
> I see several parts of the task.
>
> *1.       **Provide CRUDL operations for Solr data through Sling API.*
>
> This will allow creating Sling resources residing in Solr server and
> querying them through Sling API using Solr search capabilities. Solr query
> syntax should be used for queries.
>
> From Sling API perspective custom *ResourceProvider *(and *Resource*)
> implementation will be created additionally implementing *
> QueriableResourceProvider* and *ModifyingResourceProvider*. (If necessary *
> RefreshableResourceProvider* and *DynamicResourceProvider* interfaces will
> also be implemented). To communicate with Solr server Solrj API will be
> used.
>


yes (and you might want to think about runing Solr embedded for dev
purposes).


>
> *2.       **Provide convenient ways to create Solr resources based on
> different data.*
>
> *2.1.    **Create Solr resource based on arbitrary Sling resource*. This
> will allow adding Sling resources to Solr server for efficient search. The
> created Solr resource will also hold a reference (most likely, resource
> path) to the original Sling resource. The *Adaptable* concept seems to be a
> reasonable way of implementing this functionality – to “convert” arbitrary
> Sling resource to Solr resource and resolve original Sling resource based
> on Solr resource.
>
> Also I think that not all metadata of Sling resource should be used when
> creating corresponding Solr resource – so this task should also include
> some configuration to specify metadata necessary to be passed to Solr
> resource. Additionally, some transformations on resource metadata could be
> supported here.
>



I think you should think initially about just getting or resolving Solr
resources using the ResourceResolver.

Later you can add creating those resources via the
ModifyingResourceProvider. If you think of a Resource as a map of
properties, then it fits the Solr document model reasonably well. Ie a
Resource maps 1:1 with a Solr Document.



>
> * 2.2.    *When creating Solr resources not all data could be efficiently
> stored in Solr – for instance, large binary files. If this is the
> situation, one could create Sling resource (for instance, FileSystem or
> Jackrabbit) and then create Solr resource based on that Sling resource –
> this’ll allow both efficient search through Solr and effective storing
> options. As an optimization, these steps could be done automatically based
> on some configuration. So *when Solr resource is created we could analyze
> it
> * (analyze metadata, trying to adapt to certain types) *and create
> additional supporting resources in other parts of Sling virtual resource
> tree if necessary*. What do you think – is it necessary to implement such
> functionality or 2.1 option will be sufficient? What useful scenarios do
> you see for this task besides the “large binary” scenario?
>


Resources may have properties that are streams. How the stream is stored
and delivered is an implementation detail of the ResourceProvider and the
object it provides. So a SolrResourceProvider might provide SolrResource
objects, which expose a SolrResourceDocument when
resource.adaptTo(SolrResourceDocument.class) is invoked.

The SolrResourceDocument might then have a getBodyStream() method.


>
> *3.       **Provide solution to support search for arbitrary Sling
> resources through Sling API using Solr capabilities.*
>
> From my point of view this one needs some external solutions to support
> things like full index, incremental index, creating different schedules,
> etc. I see that Solr DataImportHandler or Apache ManifoldCF could be
> utilized for this task. So the concept of solution here would be to write
> necessary implementation so that Sling virtual resource tree could be used
> as a data source for one of the components mentioned above. What do you
> think about this approach? Could you advice some other alternatives to Solr
> DataImportHandler and Apache ManifoldCF for implementing this task?
>
>
>
> Also I’ve got couple of questions on Sling API:
>
>    - Am I right that the “best practice” way to provide bundle with custom
> *
>    ResourceProvider* implementation is to use Apache Felix Maven SCR Plugin
>    and specify certain SCR annotations (like *@Component*, *@Service* and
>    some others) on corresponding classes – *ResourceProvider* or *
>    ResourceProviderFactory* implementation in this case?
>

IIRC you will implement a ResourceProviderFactory as a @Component with a
@Service annotation indicating it implements ResourceProviderFactory
interface. It will then build ResourceProvider objects. To check I would
need to have a quick look at the API.




>
>
>    - I see that *ResourceResolver* is intended to be used by clients to
>    obtain and work with Sling resources. Also it seems to me that it is
>    unlikely necessary to create custom *ResourceResolver* implementation
>    for the Solr integration task. But still, could you please specify some
>    valid typical cases when one would need to create custom *
>    ResourceResolver*?
>



Correct, you wont need to create a ResourceResolver.


>
>
>    - Suppose I have configured same resource provider implementation (like
>    file system resource provider or possible Solr resource provider) under
> two
>    urls “/url1” and “/url2”. Now I want to perform *findResources*/*
>    queryResources* but only for the resources residing under “/url1”. Is it
>    possible to limit search results in such way? (Probably I missed
> something,
>    but looking through source code it seems that query results from all
>    queriable resource providers supporting given query language will be
>    combined regardless where in the resource tree corresponding provider is
>    configured)
>


You may decide to limit searches to path subtrees in the query language
itself.



>
>
>
> Please write any feedback/thoughts you have after reading this vision –
> this’ll really help me to understand details further.
>
>
>

Sounds like your getting there, please remember to submit a proposal before
the deadline if your still interested.

Thanks
Ian



>
> Many thanks in advance,
>
> Ilya
>

Re: GSoC 2013 - Apache Solr backend for Apache Sling

Reply via email to