Re: GSoC 2013 - Apache Solr backend for Apache Sling

2013-05-03 Thread Ilya Velesevich
Hi,

Thanks for your time and assistance!

I've submitted my proposal for Apache Solr backend for Apache Sling task
to the GSoC site.
Looking forward to take part in the task and bring value to the community!

Thanks,
Ilya


On Thu, May 2, 2013 at 11:02 AM, Bertrand Delacretaz bdelacre...@apache.org
 wrote:

 Hi,

 On Thu, May 2, 2013 at 8:44 AM, Ian Boston i...@tfd.co.uk wrote:
  On 2 May 2013 16:34, Ilya Velesevich ilya.velesev...@gmail.com wrote:
  ...about using DataImportHandler or
  ManifoldCF to provide search for Sling resources using Solr. Could you
  share some thoughts about this task? Or you probably think that this
 task
  should not be part of GSoC as it seems there could be not enough time to
  implement such support?...

  I think there might not be enough time.

 I tend to agree - it's perfectly fine to leave some components as
 optional in the project proposal, I'd rather have something well
 written and not full-featured that something ugly that ticks all
 boxes.

 -Bertrand



Re: GSoC 2013 - Apache Solr backend for Apache Sling

2013-05-02 Thread Ilya Velesevich
Hi Ian,

Many thanks for your reply!

Also one additional clarification about using DataImportHandler or
ManifoldCF to provide search for Sling resources using Solr. Could you
share some thoughts about this task? Or you probably think that this task
should not be part of GSoC as it seems there could be not enough time to
implement such support?

Thanks,
Ilya


On Wed, May 1, 2013 at 2:05 AM, Ian Boston i...@tfd.co.uk wrote:

 Hi
 Some comments in line,
 but please remember to submit this proposal at the GSoC site so that it can
 be reviewed.
 The deadline is

 3rd May 2013

 Ie this Friday.

 Ian
 (More below).


 On 30 April 2013 19:15, Ilya Velesevich ilya.velesev...@gmail.com wrote:

  Hi Everyone,
 
  I‘m working on proposal for “Apache Solr backend for Apache Sling” task
 as
  part of Google Summer of Code 2013 –
  https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading
  articles/watching videos/looking through source code to investigate the
  topic in more depth. Now I want to describe my vision on task and
  implementation approach. All your comments/suggestions would be very
  helpful in order to improve my proposal and bring more value of
  implementing the task.
 
  I see several parts of the task.
 
  *1.   **Provide CRUDL operations for Solr data through Sling API.*
 
  This will allow creating Sling resources residing in Solr server and
  querying them through Sling API using Solr search capabilities. Solr
 query
  syntax should be used for queries.
 
  From Sling API perspective custom *ResourceProvider *(and *Resource*)
  implementation will be created additionally implementing *
  QueriableResourceProvider* and *ModifyingResourceProvider*. (If
 necessary *
  RefreshableResourceProvider* and *DynamicResourceProvider* interfaces
 will
  also be implemented). To communicate with Solr server Solrj API will be
  used.
 


 yes (and you might want to think about runing Solr embedded for dev
 purposes).


 
  *2.   **Provide convenient ways to create Solr resources based on
  different data.*
 
  *2.1.**Create Solr resource based on arbitrary Sling resource*. This
  will allow adding Sling resources to Solr server for efficient search.
 The
  created Solr resource will also hold a reference (most likely, resource
  path) to the original Sling resource. The *Adaptable* concept seems to
 be a
  reasonable way of implementing this functionality – to “convert”
 arbitrary
  Sling resource to Solr resource and resolve original Sling resource based
  on Solr resource.
 
  Also I think that not all metadata of Sling resource should be used when
  creating corresponding Solr resource – so this task should also include
  some configuration to specify metadata necessary to be passed to Solr
  resource. Additionally, some transformations on resource metadata could
 be
  supported here.
 



 I think you should think initially about just getting or resolving Solr
 resources using the ResourceResolver.

 Later you can add creating those resources via the
 ModifyingResourceProvider. If you think of a Resource as a map of
 properties, then it fits the Solr document model reasonably well. Ie a
 Resource maps 1:1 with a Solr Document.



 
  * 2.2.*When creating Solr resources not all data could be efficiently
  stored in Solr – for instance, large binary files. If this is the
  situation, one could create Sling resource (for instance, FileSystem or
  Jackrabbit) and then create Solr resource based on that Sling resource –
  this’ll allow both efficient search through Solr and effective storing
  options. As an optimization, these steps could be done automatically
 based
  on some configuration. So *when Solr resource is created we could analyze
  it
  * (analyze metadata, trying to adapt to certain types) *and create
  additional supporting resources in other parts of Sling virtual resource
  tree if necessary*. What do you think – is it necessary to implement such
  functionality or 2.1 option will be sufficient? What useful scenarios do
  you see for this task besides the “large binary” scenario?
 


 Resources may have properties that are streams. How the stream is stored
 and delivered is an implementation detail of the ResourceProvider and the
 object it provides. So a SolrResourceProvider might provide SolrResource
 objects, which expose a SolrResourceDocument when
 resource.adaptTo(SolrResourceDocument.class) is invoked.

 The SolrResourceDocument might then have a getBodyStream() method.


 
  *3.   **Provide solution to support search for arbitrary Sling
  resources through Sling API using Solr capabilities.*
 
  From my point of view this one needs some external solutions to support
  things like full index, incremental index, creating different schedules,
  etc. I see that Solr DataImportHandler or Apache ManifoldCF could be
  utilized for this task. So the concept of solution here would be to write
  necessary implementation so that Sling virtual resource tree 

Re: GSoC 2013 - Apache Solr backend for Apache Sling

2013-05-02 Thread Ian Boston
I think there might not be enough time.

Bertrand, WDYT?

Critical for project success or an add on ?

Ian


On 2 May 2013 16:34, Ilya Velesevich ilya.velesev...@gmail.com wrote:

 Hi Ian,

 Many thanks for your reply!

 Also one additional clarification about using DataImportHandler or
 ManifoldCF to provide search for Sling resources using Solr. Could you
 share some thoughts about this task? Or you probably think that this task
 should not be part of GSoC as it seems there could be not enough time to
 implement such support?

 Thanks,
 Ilya


 On Wed, May 1, 2013 at 2:05 AM, Ian Boston i...@tfd.co.uk wrote:

  Hi
  Some comments in line,
  but please remember to submit this proposal at the GSoC site so that it
 can
  be reviewed.
  The deadline is
 
  3rd May 2013
 
  Ie this Friday.
 
  Ian
  (More below).
 
 
  On 30 April 2013 19:15, Ilya Velesevich ilya.velesev...@gmail.com
 wrote:
 
   Hi Everyone,
  
   I‘m working on proposal for “Apache Solr backend for Apache Sling” task
  as
   part of Google Summer of Code 2013 –
   https://issues.apache.org/jira/browse/SLING-2795. Thus far I was
 reading
   articles/watching videos/looking through source code to investigate the
   topic in more depth. Now I want to describe my vision on task and
   implementation approach. All your comments/suggestions would be very
   helpful in order to improve my proposal and bring more value of
   implementing the task.
  
   I see several parts of the task.
  
   *1.   **Provide CRUDL operations for Solr data through Sling API.*
  
   This will allow creating Sling resources residing in Solr server and
   querying them through Sling API using Solr search capabilities. Solr
  query
   syntax should be used for queries.
  
   From Sling API perspective custom *ResourceProvider *(and *Resource*)
   implementation will be created additionally implementing *
   QueriableResourceProvider* and *ModifyingResourceProvider*. (If
  necessary *
   RefreshableResourceProvider* and *DynamicResourceProvider* interfaces
  will
   also be implemented). To communicate with Solr server Solrj API will be
   used.
  
 
 
  yes (and you might want to think about runing Solr embedded for dev
  purposes).
 
 
  
   *2.   **Provide convenient ways to create Solr resources based on
   different data.*
  
   *2.1.**Create Solr resource based on arbitrary Sling resource*.
 This
   will allow adding Sling resources to Solr server for efficient search.
  The
   created Solr resource will also hold a reference (most likely, resource
   path) to the original Sling resource. The *Adaptable* concept seems to
  be a
   reasonable way of implementing this functionality – to “convert”
  arbitrary
   Sling resource to Solr resource and resolve original Sling resource
 based
   on Solr resource.
  
   Also I think that not all metadata of Sling resource should be used
 when
   creating corresponding Solr resource – so this task should also include
   some configuration to specify metadata necessary to be passed to Solr
   resource. Additionally, some transformations on resource metadata could
  be
   supported here.
  
 
 
 
  I think you should think initially about just getting or resolving Solr
  resources using the ResourceResolver.
 
  Later you can add creating those resources via the
  ModifyingResourceProvider. If you think of a Resource as a map of
  properties, then it fits the Solr document model reasonably well. Ie a
  Resource maps 1:1 with a Solr Document.
 
 
 
  
   * 2.2.*When creating Solr resources not all data could be
 efficiently
   stored in Solr – for instance, large binary files. If this is the
   situation, one could create Sling resource (for instance, FileSystem or
   Jackrabbit) and then create Solr resource based on that Sling resource
 –
   this’ll allow both efficient search through Solr and effective storing
   options. As an optimization, these steps could be done automatically
  based
   on some configuration. So *when Solr resource is created we could
 analyze
   it
   * (analyze metadata, trying to adapt to certain types) *and create
   additional supporting resources in other parts of Sling virtual
 resource
   tree if necessary*. What do you think – is it necessary to implement
 such
   functionality or 2.1 option will be sufficient? What useful scenarios
 do
   you see for this task besides the “large binary” scenario?
  
 
 
  Resources may have properties that are streams. How the stream is stored
  and delivered is an implementation detail of the ResourceProvider and the
  object it provides. So a SolrResourceProvider might provide SolrResource
  objects, which expose a SolrResourceDocument when
  resource.adaptTo(SolrResourceDocument.class) is invoked.
 
  The SolrResourceDocument might then have a getBodyStream() method.
 
 
  
   *3.   **Provide solution to support search for arbitrary Sling
   resources through Sling API using Solr capabilities.*
  
   From my point of view this one needs 

Re: GSoC 2013 - Apache Solr backend for Apache Sling

2013-05-02 Thread Bertrand Delacretaz
Hi,

On Thu, May 2, 2013 at 8:44 AM, Ian Boston i...@tfd.co.uk wrote:
 On 2 May 2013 16:34, Ilya Velesevich ilya.velesev...@gmail.com wrote:
 ...about using DataImportHandler or
 ManifoldCF to provide search for Sling resources using Solr. Could you
 share some thoughts about this task? Or you probably think that this task
 should not be part of GSoC as it seems there could be not enough time to
 implement such support?...

 I think there might not be enough time.

I tend to agree - it's perfectly fine to leave some components as
optional in the project proposal, I'd rather have something well
written and not full-featured that something ugly that ticks all
boxes.

-Bertrand


GSoC 2013 - Apache Solr backend for Apache Sling

2013-04-30 Thread Ilya Velesevich
Hi Everyone,

I‘m working on proposal for “Apache Solr backend for Apache Sling” task as
part of Google Summer of Code 2013 –
https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading
articles/watching videos/looking through source code to investigate the
topic in more depth. Now I want to describe my vision on task and
implementation approach. All your comments/suggestions would be very
helpful in order to improve my proposal and bring more value of
implementing the task.

I see several parts of the task.

*1.   **Provide CRUDL operations for Solr data through Sling API.*

This will allow creating Sling resources residing in Solr server and
querying them through Sling API using Solr search capabilities. Solr query
syntax should be used for queries.

From Sling API perspective custom *ResourceProvider *(and *Resource*)
implementation will be created additionally implementing *
QueriableResourceProvider* and *ModifyingResourceProvider*. (If necessary *
RefreshableResourceProvider* and *DynamicResourceProvider* interfaces will
also be implemented). To communicate with Solr server Solrj API will be
used.

*2.   **Provide convenient ways to create Solr resources based on
different data.*

*2.1.**Create Solr resource based on arbitrary Sling resource*. This
will allow adding Sling resources to Solr server for efficient search. The
created Solr resource will also hold a reference (most likely, resource
path) to the original Sling resource. The *Adaptable* concept seems to be a
reasonable way of implementing this functionality – to “convert” arbitrary
Sling resource to Solr resource and resolve original Sling resource based
on Solr resource.

Also I think that not all metadata of Sling resource should be used when
creating corresponding Solr resource – so this task should also include
some configuration to specify metadata necessary to be passed to Solr
resource. Additionally, some transformations on resource metadata could be
supported here.

* 2.2.*When creating Solr resources not all data could be efficiently
stored in Solr – for instance, large binary files. If this is the
situation, one could create Sling resource (for instance, FileSystem or
Jackrabbit) and then create Solr resource based on that Sling resource –
this’ll allow both efficient search through Solr and effective storing
options. As an optimization, these steps could be done automatically based
on some configuration. So *when Solr resource is created we could analyze it
* (analyze metadata, trying to adapt to certain types) *and create
additional supporting resources in other parts of Sling virtual resource
tree if necessary*. What do you think – is it necessary to implement such
functionality or 2.1 option will be sufficient? What useful scenarios do
you see for this task besides the “large binary” scenario?

*3.   **Provide solution to support search for arbitrary Sling
resources through Sling API using Solr capabilities.*

From my point of view this one needs some external solutions to support
things like full index, incremental index, creating different schedules,
etc. I see that Solr DataImportHandler or Apache ManifoldCF could be
utilized for this task. So the concept of solution here would be to write
necessary implementation so that Sling virtual resource tree could be used
as a data source for one of the components mentioned above. What do you
think about this approach? Could you advice some other alternatives to Solr
DataImportHandler and Apache ManifoldCF for implementing this task?



Also I’ve got couple of questions on Sling API:

   - Am I right that the “best practice” way to provide bundle with custom *
   ResourceProvider* implementation is to use Apache Felix Maven SCR Plugin
   and specify certain SCR annotations (like *@Component*, *@Service* and
   some others) on corresponding classes – *ResourceProvider* or *
   ResourceProviderFactory* implementation in this case?


   - I see that *ResourceResolver* is intended to be used by clients to
   obtain and work with Sling resources. Also it seems to me that it is
   unlikely necessary to create custom *ResourceResolver* implementation
   for the Solr integration task. But still, could you please specify some
   valid typical cases when one would need to create custom *
   ResourceResolver*?


   - Suppose I have configured same resource provider implementation (like
   file system resource provider or possible Solr resource provider) under two
   urls “/url1” and “/url2”. Now I want to perform *findResources*/*
   queryResources* but only for the resources residing under “/url1”. Is it
   possible to limit search results in such way? (Probably I missed something,
   but looking through source code it seems that query results from all
   queriable resource providers supporting given query language will be
   combined regardless where in the resource tree corresponding provider is
   configured)



Please write any feedback/thoughts you have 

Re: GSoC 2013 - Apache Solr backend for Apache Sling

2013-04-30 Thread Ian Boston
Hi
Some comments in line,
but please remember to submit this proposal at the GSoC site so that it can
be reviewed.
The deadline is

3rd May 2013

Ie this Friday.

Ian
(More below).


On 30 April 2013 19:15, Ilya Velesevich ilya.velesev...@gmail.com wrote:

 Hi Everyone,

 I‘m working on proposal for “Apache Solr backend for Apache Sling” task as
 part of Google Summer of Code 2013 –
 https://issues.apache.org/jira/browse/SLING-2795. Thus far I was reading
 articles/watching videos/looking through source code to investigate the
 topic in more depth. Now I want to describe my vision on task and
 implementation approach. All your comments/suggestions would be very
 helpful in order to improve my proposal and bring more value of
 implementing the task.

 I see several parts of the task.

 *1.   **Provide CRUDL operations for Solr data through Sling API.*

 This will allow creating Sling resources residing in Solr server and
 querying them through Sling API using Solr search capabilities. Solr query
 syntax should be used for queries.

 From Sling API perspective custom *ResourceProvider *(and *Resource*)
 implementation will be created additionally implementing *
 QueriableResourceProvider* and *ModifyingResourceProvider*. (If necessary *
 RefreshableResourceProvider* and *DynamicResourceProvider* interfaces will
 also be implemented). To communicate with Solr server Solrj API will be
 used.



yes (and you might want to think about runing Solr embedded for dev
purposes).



 *2.   **Provide convenient ways to create Solr resources based on
 different data.*

 *2.1.**Create Solr resource based on arbitrary Sling resource*. This
 will allow adding Sling resources to Solr server for efficient search. The
 created Solr resource will also hold a reference (most likely, resource
 path) to the original Sling resource. The *Adaptable* concept seems to be a
 reasonable way of implementing this functionality – to “convert” arbitrary
 Sling resource to Solr resource and resolve original Sling resource based
 on Solr resource.

 Also I think that not all metadata of Sling resource should be used when
 creating corresponding Solr resource – so this task should also include
 some configuration to specify metadata necessary to be passed to Solr
 resource. Additionally, some transformations on resource metadata could be
 supported here.




I think you should think initially about just getting or resolving Solr
resources using the ResourceResolver.

Later you can add creating those resources via the
ModifyingResourceProvider. If you think of a Resource as a map of
properties, then it fits the Solr document model reasonably well. Ie a
Resource maps 1:1 with a Solr Document.




 * 2.2.*When creating Solr resources not all data could be efficiently
 stored in Solr – for instance, large binary files. If this is the
 situation, one could create Sling resource (for instance, FileSystem or
 Jackrabbit) and then create Solr resource based on that Sling resource –
 this’ll allow both efficient search through Solr and effective storing
 options. As an optimization, these steps could be done automatically based
 on some configuration. So *when Solr resource is created we could analyze
 it
 * (analyze metadata, trying to adapt to certain types) *and create
 additional supporting resources in other parts of Sling virtual resource
 tree if necessary*. What do you think – is it necessary to implement such
 functionality or 2.1 option will be sufficient? What useful scenarios do
 you see for this task besides the “large binary” scenario?



Resources may have properties that are streams. How the stream is stored
and delivered is an implementation detail of the ResourceProvider and the
object it provides. So a SolrResourceProvider might provide SolrResource
objects, which expose a SolrResourceDocument when
resource.adaptTo(SolrResourceDocument.class) is invoked.

The SolrResourceDocument might then have a getBodyStream() method.



 *3.   **Provide solution to support search for arbitrary Sling
 resources through Sling API using Solr capabilities.*

 From my point of view this one needs some external solutions to support
 things like full index, incremental index, creating different schedules,
 etc. I see that Solr DataImportHandler or Apache ManifoldCF could be
 utilized for this task. So the concept of solution here would be to write
 necessary implementation so that Sling virtual resource tree could be used
 as a data source for one of the components mentioned above. What do you
 think about this approach? Could you advice some other alternatives to Solr
 DataImportHandler and Apache ManifoldCF for implementing this task?



 Also I’ve got couple of questions on Sling API:

- Am I right that the “best practice” way to provide bundle with custom
 *
ResourceProvider* implementation is to use Apache Felix Maven SCR Plugin
and specify certain SCR annotations (like *@Component*, *@Service* and
some others)