Re: Using Cassandra as Back End for publish

2014-09-04 Thread Michael Marth
Hi,

I think your best guess would be
http://jackrabbit.apache.org/oak/docs/nodestore/documentmk.html
as a general overview (even if skewed towards MongoDB) and looking into
http://svn.apache.org/viewvc/jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/
There is the Mongo impl, as well as the upcoming impl for relational DBs

Cheers
Michael


On 04 Sep 2014, at 17:18, Abhijit Mazumder  wrote:

> Hi Michael,
> I would love to. Currently we are designing keeping  Mongo as back end for
> author with Scene 7 cloud for image repository. For author with asset heavy
> operations Mongo is automatic choice. However for Publish we are tend to
> think now Cassandra would be better for User generated content and linear
> scalability.
> I went through some of the documentation but could not find a "Getting
> Started" for custom MK implementation. Could you point me to some relevant
> documentation which would help us get started?
> 
> Regards,
> Abhijit
> 
> 
> On Thu, Sep 4, 2014 at 4:14 PM, Michael Marth  wrote:
> 
>> Hi Abhijit,
>> 
>> I assume you refer to replication as implemented in Sling and AEM. Those
>> work on top of the JCR API, so they are independent of the Micro Kernel
>> implementation.
>> 
>> For running Oak on Cassandra you would need a specific MK implementation
>> (presumably based on the DocumentMK). Is that something you intend to work
>> on (I am sure there would be a lot interest in such an impl).
>> 
>> Best regards
>> Michael
>> 
>> On 04 Sep 2014, at 11:07, Abhijit Mazumder 
>> wrote:
>> 
>>> Hi,
>>> We are considering using Cassandra as back end for the publish
>>> environment. In author we are using mongo.
>>> What are the options we have to customize replication agent to achieve
>>> this?
>>> Regards,
>>> Abhijit
>> 
>> 



Re: Using Cassandra as Back End for publish

2014-09-04 Thread Abhijit Mazumder
Hi Michael,
I would love to. Currently we are designing keeping  Mongo as back end for
author with Scene 7 cloud for image repository. For author with asset heavy
operations Mongo is automatic choice. However for Publish we are tend to
think now Cassandra would be better for User generated content and linear
scalability.
I went through some of the documentation but could not find a "Getting
Started" for custom MK implementation. Could you point me to some relevant
documentation which would help us get started?

Regards,
Abhijit


On Thu, Sep 4, 2014 at 4:14 PM, Michael Marth  wrote:

> Hi Abhijit,
>
> I assume you refer to replication as implemented in Sling and AEM. Those
> work on top of the JCR API, so they are independent of the Micro Kernel
> implementation.
>
> For running Oak on Cassandra you would need a specific MK implementation
> (presumably based on the DocumentMK). Is that something you intend to work
> on (I am sure there would be a lot interest in such an impl).
>
> Best regards
> Michael
>
> On 04 Sep 2014, at 11:07, Abhijit Mazumder 
> wrote:
>
> > Hi,
> >  We are considering using Cassandra as back end for the publish
> > environment. In author we are using mongo.
> > What are the options we have to customize replication agent to achieve
> > this?
> > Regards,
> > Abhijit
>
>


Re: Using BlobStore by default with SegmentNodeStore

2014-09-04 Thread Davide Giannella
On 04/09/2014 12:25, Chetan Mehrotra wrote:
> ... (supermegacut!)
>
> Thoughts?
>
As you mentioned AEM, the deployment based on JR2 already delivers 2
different directories for repository/segment and blobs.

Both AEM and JR2 are used to run separate tasks for cleaning the blobs IIRC.

So I'm in favour of having as default segment+blob. My only concern are
the deployment already in place. We may need (if not there already) a
process/tool for "migrating" between the two scenarios.

Davide




Using BlobStore by default with SegmentNodeStore

2014-09-04 Thread Chetan Mehrotra
Hi Team,

Currently SegmentNodeStore does not uses BlobStore by default and
stores the binary data within data tar files. This has the goodness
that

1. Backup is simpler - User just needs to backup segmentstore directory
2. No Blob GC - The RevisionGC would also delete the binary content and a
separate Blob GC need not be performed
3. Faster IO - The binary content would be fetched via memory mapped files
and hence might have better performance compared to streamed io.

However of late we are seeing issue where repository is not able to
reclaim space from deleted binary content as part of normal cleanup
and full scale compaction needs to be performed to reclaim the space.
However running compaction has other issue (see OAK-2045) and
currently it needs to be performed offline to get optimum results.

In quite a few cases it has been see that repository growth is mostly
due to Lucene index content changes which leads to creation of new
binary content and also causes fragmentation due to newer revisions.
Further as Segment logic does not perform de duplication any change in
Lucene index file would probably re create the whole index file (need
to confirm).

Given that such repository growth is troublesome it might be better if
we configure a BlobStore by default with SegmentNodeStore (or atleast
for applications like AEM). This should reduce the rate of repository
growth due to

1. De duplication - BlobStore and DataStore (current impls) implement
de duplication so adding same binary would not cause size growth

2. Lesser Fragmentation - As large binary content would not be part of
data tar files Blob GC would be able to reclaim space. Currently
in a cleanup if even one bulk segment in a data tar is having a
reference the cleanup would not be able to remove that. That space can
only be reclaimed via compaction.

Compared to benefits mentioned initially

1. Backup - User needs to backup two folders
2. Blob GC would need to be run separately
3. Faster IO - That needs to be seen. For Lucene this can be mitigated
to an extent with proposed CopyOnReadDirectory support in OAK-1724

Further we also get the benefit of sharing the BlobStore between
multiple instances if required!!

Thoughts?

Chetan Mehrotra


Re: Using Cassandra as Back End for publish

2014-09-04 Thread Michael Marth
Hi Abhijit,

I assume you refer to replication as implemented in Sling and AEM. Those work 
on top of the JCR API, so they are independent of the Micro Kernel 
implementation.

For running Oak on Cassandra you would need a specific MK implementation 
(presumably based on the DocumentMK). Is that something you intend to work on 
(I am sure there would be a lot interest in such an impl).

Best regards
Michael

On 04 Sep 2014, at 11:07, Abhijit Mazumder  wrote:

> Hi,
>  We are considering using Cassandra as back end for the publish
> environment. In author we are using mongo.
> What are the options we have to customize replication agent to achieve
> this?
> Regards,
> Abhijit



Using Cassandra as Back End for publish

2014-09-04 Thread Abhijit Mazumder
Hi,
  We are considering using Cassandra as back end for the publish
environment. In author we are using mongo.
 What are the options we have to customize replication agent to achieve
this?
Regards,
Abhijit