Re: Test failures in o.a.j.o.spi.whiteboard.WhiteboardUtilsTest

2015-11-24 Thread Chetan Mehrotra
On Tue, Nov 24, 2015 at 4:36 PM, Francesco Mari
<mari.france...@gmail.com> wrote:
> Maven home: /usr/local/Cellar/maven32/3.2.5/libexec
> Java version: 1.8.0_65, vendor: Oracle Corporation

I am on JDK 1.7.0_55 and there it passes. Would try it on JDK 8

Chetan Mehrotra


Re: Test failures in o.a.j.o.spi.whiteboard.WhiteboardUtilsTest

2015-11-24 Thread Chetan Mehrotra
Looks like on JDK 8 the MBean interface has to be public for MBean
registration to work. Done that with 1716110

@Francesco - Can you try again with updated trunk?
Chetan Mehrotra


On Tue, Nov 24, 2015 at 4:58 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> On Tue, Nov 24, 2015 at 4:36 PM, Francesco Mari
> <mari.france...@gmail.com> wrote:
>> Maven home: /usr/local/Cellar/maven32/3.2.5/libexec
>> Java version: 1.8.0_65, vendor: Oracle Corporation
>
> I am on JDK 1.7.0_55 and there it passes. Would try it on JDK 8
>
> Chetan Mehrotra


Re: Threading Question

2015-11-17 Thread Chetan Mehrotra
Have a look at webapp example [1] for suggested setup. The repository
should be created once and then reused.

Chetan Mehrotra
[1] https://github.com/apache/jackrabbit-oak/tree/trunk/oak-examples/webapp


On Wed, Nov 18, 2015 at 4:02 AM, David Marginian
<da...@davidmarginian.com> wrote:
> https://jackrabbit.apache.org/oak/docs/construct.html
>
> In a threaded environment (servlet, etc.) is it ok/recommended to create the
> repository/nodestore once and store as instance variables and then use the
> repository to create a new session per request?  I know that sessions should
> not be shared across threads but I wasn't sure about repositories.
>
> Thanks!


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-06-01 Thread Chetan Mehrotra
I have started a new mail thread around "Usecases around Binary handling in
Oak" so as to first collect the kind of usecases we need to support. Once
we decide that we can discuss the possible solution.

So lets continue the discussion on that thread

Chetan Mehrotra

On Tue, May 17, 2016 at 12:31 PM, Angela Schreiber <anch...@adobe.com>
wrote:

> Hi Oak-Devs
>
> Just for the record: This topic has been discussed in a Adobe
> internal Oak-coordination call last Wednesday.
>
> Michael Marth first provided some background information and
> we discussed the various concerns mentioned in this thread
> and tried to identity the core issue(s).
>
> Marcel, Michael Duerig and Thomas proposed alternative approaches
> on how to address the original issues that lead to the API
> proposal, which all would avoid leaking out information about
> the internal blob handling.
>
> Unfortunately we ran out of time and didn't conclude the call
> with an agreement on how to proceed.
>
> From my perception the concerns raised here could not be resolved
> by the additional information.
>
> I would suggest that we try to continue the discussion here
> on the list. Maybe with a summary of the alternative proposals?
>
> Kind regards
> Angela
>
> On 11/05/16 15:38, "Ian Boston" <i...@tfd.co.uk> wrote:
>
> >Hi,
> >
> >On 11 May 2016 at 14:21, Marius Petria <mpet...@adobe.com> wrote:
> >
> >> Hi,
> >>
> >> I would add another use case in the same area, even if it is more
> >> problematic from the point of view of security. To better support load
> >> spikes an application could return 302 redirects to  (signed) S3 urls
> >>such
> >> that binaries are fetched directly from S3.
> >>
> >
> >Perhaps that question exposes the underlying requirement for some
> >downstream users.
> >
> >This is a question, not a statement:
> >
> >If the application using Oak exposed a RESTfull API that had all the same
> >functionality as [1], and was able to perform at the scale of S3, and had
> >the same security semantics as Oak, would applications that are needing
> >direct access to S3 or a File based datastore be able to use that API in
> >preference ?
> >
> >Is this really about issues with scalability and performance rather than a
> >fundamental need to drill deep into the internals of Oak ? If so,
> >shouldn't
> >the scalability and performance be fixed ? (assuming its a real concern)
> >
> >
> >
> >
> >>
> >> (if this can already be done or you think is not really related to the
> >> other two please disregard).
> >>
> >
> >AFAIK this is not possible at the moment. If it was deployments could use
> >nginX X-SendFile and other request offloading mechanisms.
> >
> >Best Regards
> >Ian
> >
> >
> >1 http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html
> >
> >
> >>
> >> Marius
> >>
> >>
> >>
> >> On 5/11/16, 1:41 PM, "Angela Schreiber" <anch...@adobe.com> wrote:
> >>
> >> >Hi Chetan
> >> >
> >> >IMHO your original mail didn't write down the fundamental analysis
> >> >but instead presented the solution for every the 2 case I was
> >> >lacking the information _why_ this is needed.
> >> >
> >> >Both have been answered in private conversions only (1 today in
> >> >the oak call and 2 in a private discussion with tom). And
> >> >having heard didn't make me more confident that the solution
> >> >you propose is the right thing to do.
> >> >
> >> >Kind regards
> >> >Angela
> >> >
> >> >On 11/05/16 12:17, "Chetan Mehrotra" <chetan.mehro...@gmail.com>
> wrote:
> >> >
> >> >>Hi Angela,
> >> >>
> >> >>On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber <anch...@adobe.com>
> >> >>wrote:
> >> >>
> >> >>> Quite frankly I would very much appreciate if took the time to
> >>collect
> >> >>> and write down the required (i.e. currently known and expected)
> >> >>> functionality.
> >> >>>
> >> >>> Then look at the requirements and look what is wrong with the
> >>current
> >> >>> API that we can't meet those requirements:
> >> >>> - is it just missing API extensions that can be added with moderate
> >> >>>eff

Re: svn commit: r1724598 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/api/ main/java/org/apache/jackrabbit/oak/plugins/document/rdb/ main/java/org/apache/jackrabbit/oak

2016-01-14 Thread Chetan Mehrotra
On Thu, Jan 14, 2016 at 6:40 PM,  <resc...@apache.org> wrote:
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/api/Blob.java
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBDocumentStore.java
> 
> jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/value/BinaryImpl.java

I see some changes to Blob/BinaryImpl. Are those change related to
this issue? Most likely look like a noise but just wanted to confirm

Chetan Mehrotra


Re: svn commit: r1725250 - in /jackrabbit/oak/trunk: oak-core/src/main/java/org/apache/jackrabbit/oak/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/atomic/ oak-core/src/test/java/org/apach

2016-01-18 Thread Chetan Mehrotra
Hi Davide,

On Mon, Jan 18, 2016 at 5:46 PM,  <dav...@apache.org> wrote:
> + */
> +public AtomicCounterEditorProvider() {
> +clusterSupplier = new Supplier() {
> +@Override
> +public Clusterable get() {
> +return cluster.get();
> +}
> +};
> +schedulerSupplier = new Supplier() {
> +@Override
> +public ScheduledExecutorService get() {
> +return scheduler.get();
> +}
> +};
> +storeSupplier = new Supplier() {
> +@Override
> +public NodeStore get() {
> +return store.get();
> +}
> +};
> +wbSupplier = new Supplier() {
> +@Override
> +public Whiteboard get() {
> +return whiteboard.get();
> +}
> +};
> +}

Just curious about use of above approach. Is it for keeping the
dependencies as non static or using final instance variable? If you
mark references as static then all those bind and unbind method would
not be required as by the time component is active the dependencies
would be set.


Chetan Mehrotra


Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing

2016-06-28 Thread Chetan Mehrotra
Thanks for the link. Would followup on the issue and have it fixed in branches
Chetan Mehrotra


On Mon, Jun 27, 2016 at 5:11 PM, Julian Reschke <julian.resc...@gmx.de> wrote:
> On 2016-06-27 13:31, Chetan Mehrotra wrote:
>>
>> On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server
>> <jenk...@builds.apache.org> wrote:
>>>
>>> Caused by: java.lang.IllegalArgumentException: No enum constant
>>> org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR
>>> at java.lang.Enum.valueOf(Enum.java:238)
>>> at
>>> org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45)
>>> at
>>> org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58)
>>
>>
>> The test are failing due to above issue. Is this related to presence
>> of new segment-tar module in trunk but not in branch?
>>
>> Chetan Mehrotra
>
>
> -> <https://issues.apache.org/jira/browse/OAK-4475>


Re: [VOTE] Release Apache Jackrabbit Oak 1.4.4

2016-06-27 Thread Chetan Mehrotra
On Mon, Jun 27, 2016 at 10:43 AM, Amit Jain <am...@apache.org> wrote:
[X] +1 Release this package as Apache Jackrabbit Oak 1.4.4

Chetan Mehrotra


Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/

2016-06-29 Thread Chetan Mehrotra
On Wed, Jun 29, 2016 at 1:25 PM, Francesco Mari
<mari.france...@gmail.com> wrote:
> oak-segment-tar should be releasable at any time. If I had to launch a
quick patch release this morning, I would have to either revert your commit
or postpone my release until Oak is released.

Given the current release frequency on trunk (2 week) I do not think
it should be a big problem and holding of commits break the continuity
and increases work. But then that might be just an issue for me!

For now I have reverted the changes from oak-segment-tar

Chetan Mehrotra


Re: svn commit: r1750601 - in /jackrabbit/oak/trunk: oak-segment-tar/ oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/ oak-segment-tar/src/test/java/org/apache/jackrabbit/oak/segment/

2016-06-29 Thread Chetan Mehrotra
Hi Francesco,

On Wed, Jun 29, 2016 at 12:49 PM, Francesco Mari
<mari.france...@gmail.com> wrote:
> Please do not change the "oak.version" property to a snapshot version. If
> your change relies on code that is only available in the latest snapshot of
> Oak, please revert this commit and hold it back until a proper release of
> Oak is performed.

I can do that but want to understand the impact here if we switched to
SNAPSHOT version?

For e.g. in the past we had done some changes in jackrabbit which is
need in oak then we had switched to snapshot version of JR2 and later
reverted to released version once JR2 release is done. That has worked
fine so far and we did not had to hold the feature work for that. So
want to understand why it should be different here

Chetan Mehrotra


Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java

2016-02-05 Thread Chetan Mehrotra
On Fri, Feb 5, 2016 at 2:54 PM, Michael Dürig <mdue...@apache.org> wrote:
> There's always another library ;-)

For utility stuff well almost !

Chetan Mehrotra


Re: svn commit: r1727311 - in /jackrabbit/oak/trunk/oak-core/src: main/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboard.java test/java/org/apache/jackrabbit/oak/osgi/OsgiWhiteboardTest.java

2016-01-29 Thread Chetan Mehrotra
On Fri, Jan 29, 2016 at 4:08 PM, Michael Dürig <mdue...@apache.org> wrote:
>
> Shouldn't we make this volatile?

Ack. Would do that

Chetan Mehrotra


Re: svn commit: r1728341 - /jackrabbit/oak/trunk/oak-segment/src/main/java/org/apache/jackrabbit/oak/plugins/segment/SegmentGraph.java

2016-02-03 Thread Chetan Mehrotra
On Wed, Feb 3, 2016 at 10:17 PM,  <mdue...@apache.org> wrote:
> +private static String toString(Throwable e) {
> +StringWriter sw = new StringWriter();
> +PrintWriter pw = new PrintWriter(sw, true);
> +try {
> +e.printStackTrace(pw);
> +return sw.toString();
> +} finally {
> +pw.close();
> +}
>  }
> +

May be use com.google.common.base.Throwables#getStackTraceAsString


Chetan Mehrotra


Re: testing blob equality

2016-02-29 Thread Chetan Mehrotra
On Mon, Feb 29, 2016 at 6:42 PM, Tomek Rekawek <reka...@adobe.com> wrote:
> I wonder if we can switch the order of length and identity comparison in 
> AbstractBlob#equal() method. Is there any case in which the 
> getContentIdentity() method will be slower than length()?

That can be switched but I am afraid that it would not work as
expected. In JackrabbitNodeState#createBlob determining the
contentIdentity involves determining the length. You can give
org.apache.jackrabbit.oak.upgrade.blob.LengthCachingDataStore a try
(See OAK-2882 for details)

Chetan Mehrotra


Re: R: info about jackrabbitoak.

2016-02-24 Thread Chetan Mehrotra
On Wed, Feb 24, 2016 at 2:46 PM, Ancona Francesco
<francesco.anc...@siav.it> wrote:
> that the project depends on felix (osgi) dependency.

It does not depend on Felix framework but some modules from Felix
project. There is a webapp example [1] where you can deploy the war on
Tomcat/WebContainer and have your code in the war access repository
instance

Chetan Mehrotra
[1] https://github.com/apache/jackrabbit-oak/tree/trunk/oak-examples/webapp


Re: Issue using the text extraction with lucene

2016-01-23 Thread Chetan Mehrotra
On Sat, Jan 23, 2016 at 9:34 PM, Stephan Becker
<stephan.bec...@netcentric.biz> wrote:
> Exception in thread "main" java.lang.NoSuchMethodError:
> org.apache.commons.csv.CSVFormat.withIgnoreSurroundingSpaces()Lorg/apache/commons/csv/CSVFormat;

Looks like tika-app-1.11 is using commons-csv 1.0 [1] while Oak uses
1.1 and CSVFormat.withIgnoreSurroundingSpaces is added in v1.1. We
tested it earlier with Tika 1.6. So you can try adding commons-csv jar
as the first one in the classpath

java -cp commons-csv-1.1.jar:tika-app-1.11.jar:oak-run-1.2.4.jar

Chetan Mehrotra
[1]http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-parsers/pom.xml?view=markup#l328


Re: Issue using the text extraction with lucene

2016-01-24 Thread Chetan Mehrotra
On Sun, Jan 24, 2016 at 2:28 AM, Stephan Becker
<stephan.bec...@netcentric.biz> wrote:
> How does it then further extract the
> text from added documents?

Currently the extracted text support does not allow updates i.e. it
only has extracted text at the time when extraction is done via the
tool. Later extracted text would not be added. The primary aim was to
speed up indexing time in migration.

Chetan Mehrotra


Re: Restructure docs

2016-01-20 Thread Chetan Mehrotra
On Wed, Jan 20, 2016 at 2:46 PM, Davide Giannella <dav...@apache.org> wrote:
> When you change/add/remove an item from the left-hand menu, you'll have
> to redeploy the whole site as it will be hardcoded within the html of
> each page. Deploying the whole website is a long process. Therefore
> limiting the changes over there make things faster.

I mostly do partial commit i.e. only the modified page and it ha
worked well. Changing of left side menu is not a very frequent task
and for that I think doing full deploy of site is fine for now

Chetan Mehrotra


Re: JUnit tests with FileDataStore

2016-01-27 Thread Chetan Mehrotra
To make use of FileDataStore you would need to configure a
SegmentNodeStore as MemoryNodeStore does not allow plugging in custom
BlobStore

Have a look at snippet [1] for a possible approach

Chetan Mehrotra
[1] https://gist.github.com/chetanmeh/6242d0a7fe421955d456


On Wed, Jan 27, 2016 at 6:42 AM, Tobias Bocanegra <tri...@apache.org> wrote:
> Hi,
>
> I have some tests in filevault that I want to run with the
> FileDataStore, but I couldn't figure out how to setup the repository
> correctly here [0]. I also looked at the tests in oak, but I couldn't
> find a valid reference.
>
> The reason for this is to test the binary references, which afaik only
> work with the FileDataStore.
> at least my test [1] works with jackrabbit, but not for oak.
>
> thanks.
> regards, toby
>
> [0] 
> https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/IntegrationTestBase.java#L118-L120
> [1] 
> https://github.com/apache/jackrabbit-filevault/blob/trunk/vault-core/src/test/java/org/apache/jackrabbit/vault/packaging/integration/TestBinarylessExport.java


Re: parent pom env.OAK_INTEGRATION_TESTING

2016-03-22 Thread Chetan Mehrotra
On Tue, Mar 22, 2016 at 9:49 PM, Davide Giannella <dav...@apache.org> wrote:
> I can't really recall why and if we use this.

Its referred to in main README.md so as to allow a developer to always
enable running of integration test

Chetan Mehrotra


Re: oak-resilience

2016-03-07 Thread Chetan Mehrotra
Cool stuff Tomek! This was something which was discussed in last
Oakathon so great to have a way to do resilience testing
programatically. Would give it a try
Chetan Mehrotra


On Mon, Mar 7, 2016 at 1:49 PM, Stefan Egli <stefane...@apache.org> wrote:
> Hi Tomek,
>
> Would also be interesting to see the effect on the leases and thus
> discovery-lite under high memory load and network problems.
>
> Cheers,
> Stefan
>
> On 04/03/16 11:13, "Tomek Rekawek" <reka...@adobe.com> wrote:
>
>>Hello,
>>
>>For some time I've worked on a little project called oak-resilience. It
>>aims to be a resilience testing framework for the Oak. It uses
>>virtualisation to run Java code in a controlled environment, that can be
>>spoilt in different ways, by:
>>
>>* resetting the machine,
>>* filling the JVM memory,
>>* filling the disk,
>>* breaking or deteriorating the network.
>>
>>I described currently supported features in the README file [1].
>>
>>Now, once I have a hammer I'm looking for a nail. Could you share your
>>thoughts on areas/features in Oak which may benefit from being
>>systematically tested for the resilience in the way described above?
>>
>>Best regards,
>>Tomek
>>
>>[1]
>>https://github.com/trekawek/jackrabbit-oak/tree/resilience/oak-resilience
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>reka...@adobe.com
>>
>
>


Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java

2016-04-01 Thread Chetan Mehrotra
On Fri, Apr 1, 2016 at 6:40 PM, Julian Reschke <julian.resc...@gmx.de> wrote:
> Did you benchmark System.currentTimeMillis() as opposed to checking the log
> level?

Well time taken by single isDebugEnabled would always be less than
System.currentTimeMillis()  + isDebugEnabled! In this case it anyway
does not matter much as remote call would have much more overhead.

Suggestion here was more to have a consistent way of doing such things
but not a hard requirement per se ...

Chetan Mehrotra


Re: svn commit: r1737349 - /jackrabbit/oak/trunk/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/rdb/RDBConnectionHandler.java

2016-04-01 Thread Chetan Mehrotra
Hi Julian,

On Fri, Apr 1, 2016 at 5:19 PM,  <resc...@apache.org> wrote:
> +@Nonnull
> +private Connection getConnection() throws IllegalStateException, 
> SQLException {
> +long ts = System.currentTimeMillis();
> +Connection c = getDataSource().getConnection();
> +if (LOG.isDebugEnabled()) {
> +long elapsed = System.currentTimeMillis() - ts;
> +if (elapsed >= 100) {
> +LOG.debug("Obtaining a new connection from " + this.ds + " 
> took " + elapsed + "ms");
> +}
> +}
> +return c;
> +}

You can also use PerfLogger here which is also used in other places in
DocumentNodeStore

---
final PerfLogger PERFLOG = new PerfLogger(
LoggerFactory.getLogger(DocumentNodeStore.class.getName()
+ ".perf"));

final long start = PERFLOG.start();
Connection c = getDataSource().getConnection();
PERFLOG.end(start, 100, "Obtaining a new connection from {} ", ds);
---

This would also avoid the call to System.currentTimeMillis() if debug
log is not enabled

Chetan Mehrotra


Re: [VOTE] Please vote for the final name of oak-segment-next

2016-04-26 Thread Chetan Mehrotra
Missed sending nomination on earlier thread. If not late then one more
proposal

oak-segment-v2

This is somewhat similar to names used in Mongo mmapv1 and mmapv2.

Chetan Mehrotra

On Tue, Apr 26, 2016 at 2:32 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
wrote:

> oak-segment-store +1
>
> Regards,
> Tommaso
>
> Il giorno lun 25 apr 2016 alle ore 16:52 Vikas Saurabh <
> vikas.saur...@gmail.com> ha scritto:
>
> > > oak-embedded-store +1
> >
> >
> > Thanks,
> > Vikas
> >
>


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
To highlight - As mentioned earlier the user of proposed api is tying
itself to implementation details of Oak and if this changes later then that
code would also need to be changed. Or as Ian summed it up

> if the API is introduced it should create an out of band agreement with
the consumers of the API to act responsibly.

The method is to be used for those important case where you do rely on
implementation detail to get optimal performance in very specific
scenarios. Its like DocumentNodeStore making use of some Mongo specific API
to perform some important critical operation to achieve better performance
by checking if the underlying DocumentStore is Mongo based.

I have seen discussion of JCR-3534 and other related issue but still do not
see any conclusion on how to answer such queries where direct access to
blobs is required for performance aspect. This issue is not about exposing
the blob reference for remote access but more about optimal path for in VM
access

> who owns the resource? Who coordinates (concurrent) access to it and how?
What are the correctness and performance implications here (races,
deadlock, corruptions, JCR semantics)?

The client code would need to be implemented in a proper way. Its more like
implementing a CommitHook. If implemented in incorrect way it would cause
issues deadlocks etc. But then we assume that any one implementing that
interface would take proper care in implementation.

>  it limits implementation freedom and hinders further evolution
(chunking, de-duplication, content based addressing, compression, gc, etc.)
for data stores.

As mentioned earlier. Some part of API indicates a closer dependency on how
things work (like SPI, or ConsumerType AP on OSGi terms). By using such API
client code definitely ties itself to Oak implementation detail but it
should not limit how Oak implementation detail evolve. So when it changes
client code need to adapt itself accordingly. Oak can express that
by increment the minor version of exported package to indicate change
in behavior.

> bypassing JCR's security model

I yet do not see the attack vector which we need to defend differently
here. Again the blob url is not being exposed say as part of webdav or any
other remote call. So would like to understand the security concern better
here (unless it defending against a malicious , badly implemented client
code which we discussed above)

> Can't we come up with an API that allows the blobs to stay under control
of Oak?

The code need to work either at OS level say file handle or say S3 object.
So I do not see a way where it can work without having access to those
details

FWIW there is code out there which reverse engineers the blobId to access
the actual binary. People do it so as to get decent throughput in image
rendition logic for large scale deployment. The proposal here was to
formalize that approach by providing a proper api. If we do not provide
such an API then the only way for them would be to continue relying on
reverse engineering the blobId!

> If not, this is probably an indication that those blobs shouldn't go into
Oak but just references to it as Francesco already proposed. Anything else
is whether fish nor fowl: you can't have the JCR goodies but at the same
time access underlying resources at will.

Thats a fine argument to make. But then users here have real problem to
solve which we should not ignore. Oak based systems are being proposed for
large asset deployment where one of the primary requirement is asset
handling/processing of 100 of TB of binary data. So we would then have to
recommend for such cases to not use JCR Binary abstraction and manage the
binaries on your own. That would then solve both the problems (that might
though break lots of tooling build on top of JCR API to manage those
binaries)!

Thinking more - Another approach that I can then suggest it people
implement there own BlobStore (may be by extending ours) and provide this
API there i.e. say which takes Blob id and provide the required details.
This way we "outsource" the problem. Would that be acceptable?

Chetan Mehrotra

On Mon, May 9, 2016 at 2:28 PM, Michael Dürig <mdue...@apache.org> wrote:

>
> Hi,
>
> I very much share Francesco's concerns here. Unconditionally exposing
> access to operation system resources underlying Oak's inner working is
> troublesome for various reasons:
>
> - who owns the resource? Who coordinates (concurrent) access to it and
> how? What are the correctness and performance implications here (races,
> deadlock, corruptions, JCR semantics)?
>
> - it limits implementation freedom and hinders further evolution
> (chunking, de-duplication, content based addressing, compression, gc, etc.)
> for data stores.
>
> - bypassing JCR's security model
>
> Pretty much all of this has been discussed in the scope of
> https://issues.apache.org/jira/browse/JCR-3534 and
> https://is

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Had an offline discussion with Michael on this and explained the usecase
requirement in more details. One concern that has been raised is that such
a generic adaptTo API is too inviting for improper use and Oak does not
have any context around when this url is exposed for what time it is used.

So instead of having a generic adaptTo API at JCR level we can have a
BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
Once we have a consensus then we can go over the details

interface BlobProcessor {
   void process(AdaptableBlob blob);
}

Where AdaptableBlob is

public interface AdaptableBlob {
 AdapterType adaptTo(Class type);
}

The BlobProcessor instance can be passed via BlobStore API. So client would
look for a BlobStore service (so use the Oak level API) and pass it the
ContentIdentity of JCR Binary aka blobId

interface BlobStore{
 void process(String blobId, BlobProcessor processor)
}

The approach ensures

1. That any blob handle exposed is only guaranteed for the duration
of  'process' invocation
2. There is no guarantee on the utility of blob handle (File, S3 Object)
beyond the callback. So one should not collect the passed File handle for
later use

Hopefully this should address some of the concerns raised in this thread.
Looking forward to feedback :)

Chetan Mehrotra

On Mon, May 9, 2016 at 6:24 PM, Michael Dürig <mdue...@apache.org> wrote:

>
>
> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>
>> To highlight - As mentioned earlier the user of proposed api is tying
>> itself to implementation details of Oak and if this changes later then
>> that
>> code would also need to be changed. Or as Ian summed it up
>>
>> if the API is introduced it should create an out of band agreement with
>>>
>> the consumers of the API to act responsibly.
>>
>
> So what does "to act responsibly" actually means? Are we even in a
> position to precisely specify this? Experience tells me that we only find
> out about those semantics after the fact when dealing with painful and
> expensive customer escalations.
>
> And even if we could, it would tie Oak into very tight constraints on how
> it has to behave and how not. Constraints that would turn out prohibitively
> expensive for future evolution. Furthermore a huge amount of resources
> would be required to formalise such constraints via test coverage to guard
> against regressions.
>
>
>
>> The method is to be used for those important case where you do rely on
>> implementation detail to get optimal performance in very specific
>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>> API
>> to perform some important critical operation to achieve better performance
>> by checking if the underlying DocumentStore is Mongo based.
>>
>
> Right, but the Mongo specific API is a (hopefully) well thought through
> API where as with your proposal there are a lot of open questions and
> concerns as per my last mail.
>
> Mongo (and any other COTS DB) for good reasons also don't give you direct
> access to its internal file handles.
>
>
>
>> I have seen discussion of JCR-3534 and other related issue but still do
>> not
>> see any conclusion on how to answer such queries where direct access to
>> blobs is required for performance aspect. This issue is not about exposing
>> the blob reference for remote access but more about optimal path for in VM
>> access
>>
>
> One bottom line of the discussions in that issue is that we came to a
> conclusion after clarifying the specifics of the use case. Something I'm
> still missing here. The case you brought forward is too general to serve as
> a guideline for a solution. Quite to the contrary, to me it looks like a
> solution to some problem (I'm trying to understand).
>
>
>
>> who owns the resource? Who coordinates (concurrent) access to it and how?
>>>
>> What are the correctness and performance implications here (races,
>> deadlock, corruptions, JCR semantics)?
>>
>> The client code would need to be implemented in a proper way. Its more
>> like
>> implementing a CommitHook. If implemented in incorrect way it would cause
>> issues deadlocks etc. But then we assume that any one implementing that
>> interface would take proper care in implementation.
>>
>
> But a commit hook is an internal SPI. It is not advertised to the whole
> world as a public API.
>
>
>
>>  it limits implementation freedom and hinders further evolution
>>>
>> (chunking, de-duplication, content based addressing, compression, gc,
>> etc.)
>> for data stores.
>>
>> As mentioned earlier. Some part of API indicates a clo

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
> what guarantees do/can we give re. this file handle within this context.
Can it suddenly go away (e.g. because of gc or internal re-organisation)?
How do we establish, test and maintain (e.g. from regressions) such
guarantees?

Logically it should not go away suddenly. So GC logic should be aware of
such "inUse" instances (there is already such support for inUse cases).
Such a requirement can be validated via integration testcase

>  and more concerningly, how do we protect Oak from data corruption by
misbehaving clients? E.g. clients writing on that handle or removing it?
Again, if this is public API we need ways to test this.

Not sure by misbehaving client - Is it malicious (by design) or badly
written code. For later yes that might pose a problem but we can have some
defense. I would expect the code making use of the api to behave properly.
In addition as proposed above [1] for FileDataStore we can provide a
symlinked file reference which exposes a read only file handle. For
S3DataStore code should have access to aws credentials to perform any write
operation, which should be a sufficient defense

> In an earlier mail you quite fittingly compared this to commit hooks,
which for good reason are an internal SPI.

Bit of nit pick here ;) As per Jcr class [1] one can provide a CommitHook
instance so not sure if we can term it internal. However point that I
wanted to emphasize is that Oak does provide some critical extension point
and with a misbehaving code one can shoot himself at foot and as
implementation only so much can be done.

regards
Chetan
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:237kzuhor5y3tpli+state:results
[2]
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-jcr/src/main/java/org/apache/jackrabbit/oak/jcr/Jcr.java#L190

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-11 Thread Chetan Mehrotra
Hi Angela,

On Tue, May 10, 2016 at 9:49 PM, Angela Schreiber <anch...@adobe.com> wrote:

> Quite frankly I would very much appreciate if took the time to collect
> and write down the required (i.e. currently known and expected)
> functionality.
>
> Then look at the requirements and look what is wrong with the current
> API that we can't meet those requirements:
> - is it just missing API extensions that can be added with moderate effort?
> - are there fundamental problems with the current API that we needed to
> address?
> - maybe we even have intrinsic issues with the way we think about the role
> of the repo?
>
> IMHO, sticking to kludges might look promising on a short term but
> I am convinced that we are better off with a fundamental analysis of
> the problems... after all the Binary topic comes up on a regular basis.
> That leaves me with the impression that yet another tiny extra and
> adaptables won't really address the core issues.
>

Makes sense.

Have a look in of the initial mail in the thread at [1] which talks about
the 2 usecase I know of. The image rendition usecase manifest itself in one
form or other, basically providing access to Native programs via file path
reference.

The approach proposed so far would be able to address them and hence closer
to "is it just missing API extensions that can be added with moderate
effort?". If there are any other approach we can address both of the
referred usecases then we implement them.

Let me know if more details are required. If required I can put it up on a
wiki page also.

Chetan Mehrotra
[1]
http://markmail.org/thread/6mq4je75p64c5nyn#query:+page:1+mid:zv5dzsgmoegupd7l+state:results


API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-03 Thread Chetan Mehrotra
Hi Team,

For OAK-1963 we need to allow access to actaul Blob location say in form
File instance or S3 object id etc. This access is need to perform optimized
IO operation around binary object e.g.

1. The File object can be used to spool the file content with zero copy
using NIO by accessing the File Channel directly [1]

2. Client code can efficiently replicate a binary stored in S3 by having
direct access to S3 object using copy operation

To allow such access we would need a new API in the form of
AdaptableBinary.

API
===

public interface AdaptableBinary {

/**
 * Adapts the binary to another type like File, URL etc
 *
 * @param  The generic type to which this binary is adapted
 *to
 * @param type The Class object of the target type, such as
 *File.class
 * @return The adapter target or null if the binary cannot
 * adapt to the requested type
 */
 AdapterType adaptTo(Class type);
}

Usage
=

Binary binProp = node.getProperty("jcr:data").getBinary();

//Check if Binary is of type AdaptableBinary
if (binProp instanceof AdaptableBinary){
 AdaptableBinary adaptableBinary = (AdaptableBinary) binProp;

//Adapt it to File instance
 File file = adaptableBinary.adaptTo(File.class);
}



The Binary instance returned by Oak
i.e. org.apache.jackrabbit.oak.plugins.value.BinaryImpl would then
implement this interface and calling code can then check the type and cast
it and then adapt it

Key Points


1. Depending on backing BlobStore the binary can be adapted to various
types. For FileDataStore it can be adapted to File. For S3DataStore it can
either be adapted to URL or some S3DataStore specific type.

2. Security - Thomas suggested that for better security the ability to
adapt should be restricted based on session permissions. So if the user has
required permission then only adaptation would work otherwise null would be
returned.

3. Adaptation proposal is based on Sling Adaptable [2]

4. This API is for now exposed only at JCR level. Not sure should we do it
at Oak level as Blob instance are currently not bound to any session. So
proposal is to place this in 'org.apache.jackrabbit.oak.api' package

Kindly provide your feedback! Also any suggestion/guidance around how the
access control be implemented

Chetan Mehrotra
[1] http://www.ibm.com/developerworks/library/j-zerocopy/
[2]
https://sling.apache.org/apidocs/sling5/org/apache/sling/api/adapter/Adaptable.html


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Wed, May 4, 2016 at 10:07 PM, Ian Boston <i...@tfd.co.uk> wrote:

> If the File or URL is writable, will writing to the location cause issues
> for Oak ?
>

Yes that would cause problem. Expectation here is that code using a direct
location needs to behave responsibly.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 4:38 PM, Francesco Mari <mari.france...@gmail.com>
wrote:

> The security concern is quite easy to explain: it's a bypass of our
> security model. Imagine that, using a session with the appropriate
> privileges, a user accesses a Blob and adapts it to a file handle, an S3
> bucket or a URL. This code passes this reference to another piece of code
> that modifies the data directly even if - in the same deployment - it
> shouldn't be able to access the Blob instance to begin with.
>

How is this different from the case where a code obtains a Node via an
admin session and passes that Node instance to another code which say
deletes important content via it. In the end we have to trust the client
code to do correct thing when given appropriate rights. So in current
proposal the code can only adapt the binary if the session has expected
permissions. Post that we need to trust the code to behave properly.

> In both the use case, the customer is coupling the data with the most
> appropriate storage solution for his business case. In this case, customer
> code - and not Oak - should be responsible for the management of that
data.

Well then it means that customer implements its very own DataStore like
solution and all the application code do not make use of JCR Binary and
instead use another service to resolve the references. This would greatly
reduce the usefulness of JCR for asset heavy application which use JCR to
manage binary content along with its metadata


Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
> This proposal introduces a huge leak of abstractions and has deep security
implications.

I understand the leak of abstractions concern. However would like to
understand the security concern bit more.

One way I can think of that it can cause security concern is you have some
malicious code running in same jvm which can then do bad things with the
file handle. Do note that the File handle would not get exposed via any
remoting api we currently support. Now in this case if malicious code is
already running in same jvm then security is breached and code can anyway
make use of reflection to access internal details.

So if there is any other possible security concern then would like to
discuss.

Coming to usecases

Usecase A - Image rendition generation
-

We have some bigger deployments where lots of images gets uploaded to the
repository and there are some conversions (rendition generation) which are
performed by OS specific native executables. Such programs work directly on
file handle. Without this change currently we need to first spool the file
content into some temporary location and then pass that to the other
program. This add unnecessary overhead and something which can be avoided
in case there is a FileDataStore being used where we can provide a direct
access to the file

Usecase B - Efficient replication across regions in S3
--

This for AEM based setup which is running on Oak with S3DataStore. There we
have global deployment where author instance is running in 1 region and
binary content is to be distributed to publish instances running in
different regions. The DataStore size is huge say 100TB and for efficient
operation we need to use Binary less replication. In most cases only a very
small subset of binary content would need to be present in other
regions. Current
way (via shared DataStore) to support that would involve synchronizing the
S3 bucket across all such regions which would increase the storage cost
considerable.

Instead of that plan is to replicate the specific assets via s3 copy
operation. This would ensure that big assets can be copied efficiently at
S3 level and that would require direct access to the S3 object.

Again in all such cases one can always resort to current level support i.e.
copy over all the content via inputstream into some temporary store and
then use that. But that would add considerable overhead when assets are of
100MB sizes or more. So the approach proposed would allow client code to
this efficiently depending on the underlying storage capability

> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.

The original proposed approach in OAK-1963 was like that i.e. introduce
this access method on BlobStore which works on reference. But in that case
client code would need to deal with BlobStore API. In either case access to
actual binary storage data would be required

Chetan Mehrotra

On Thu, May 5, 2016 at 2:49 PM, Tommaso Teofili <tommaso.teof...@gmail.com>
wrote:

> +1 to Francesco's concerns, exposing the location of a binary at the
> application level doesn't sound good from a security perspective.
> To me sounds like breaching the JCR and NodeState layers to directly
> manipulate NodeStore binaries (from the DataStore), e.g. to perform smart
> replication across different instances, but imho the right way to address
> that is extending one of the current DataStore implementations or create a
> new one.
> I am also concerned that this Adaptable pattern would open room for other
> such hacks into the stack.
>
> My 2 cents,
> Tommaso
>
>
> Il giorno gio 5 mag 2016 alle ore 11:00 Francesco Mari <
> mari.france...@gmail.com> ha scritto:
>
> > This proposal introduces a huge leak of abstractions and has deep
> security
> > implications.
> >
> > I guess that the reason for this proposal is that some users of Oak would
> > like to perform some operations on binaries in a more performant way by
> > leveraging the way those binaries are stored. If this is the case, I
> > suggest those users to evaluate an applicative solution implemented on
> top
> > of the JCR API.
> >
> > If a user needs to store some important binary data (files, images, etc.)
> > in an S3 bucket or on the file system for performance reasons, this
> > shouldn't affect how Oak handles blobs internally. If some assets are of
> > special interest for the user, then the user should bypass Oak and take
> > care of the storage of those assets directly. Oak can 

Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-05 Thread Chetan Mehrotra
On Thu, May 5, 2016 at 5:07 PM, Francesco Mari <mari.france...@gmail.com>
wrote:

>
> This is a totally different thing. The change to the node will be committed
> with the privileges of the session that retrieved the node. If the session
> doesn't have enough privileges to delete that node, the node will be
> deleted, There is no escape from the security model.


A "bad code" when passes a node backed via admin session can still do bad
thing as admin session has all the privileges. In same way if a bad code is
passed a file handle then it can cause issue. So I am still not sure on the
attack vector which we are defending against.

Chetan Mehrotra


Re: API proposal for - Expose URL for Blob source (OAK-1963)

2016-05-09 Thread Chetan Mehrotra
Some more points around the proposed callback based approach

1.Possible security or enforcing a read only access to the exposed file -
The file provided within the BlobProcessor callback can be a symlink
created with a os user account which only has read only access. The symlink
can be removed once the callback returns

2. S3 DataStore Security Concern - For S3 DataStore we would only be
exposing the S3 object identifier and the client code would still need the
aws credentials to connect to the bucket and perform required copy operation

3. Possibility of further optimization in S3DataStore processing -
Currently when reading a binary from S3DataStore the binary content are
*always* spooled to some local temporary file (in local cache) and then a
InputStream is opened on that file. So even if the code need to read
initial few bytes of stream the whole file would have to be read. This
happens because with current JCR Binary API we are not in control of
lifetime of exposed InputStream. So if say we expose the InputStream we
cannot determine untill when the backing S3 SDK resources need to be held

Also current S3DataStore always creates local copy - With a callback based
approach we can safely expose this file which would allow layers above to
avoid spooling the content again locally for processing. And with callback
boundary we can later do required cleanup


Chetan Mehrotra

On Mon, May 9, 2016 at 7:15 PM, Chetan Mehrotra <chetan.mehro...@gmail.com>
wrote:

> Had an offline discussion with Michael on this and explained the usecase
> requirement in more details. One concern that has been raised is that such
> a generic adaptTo API is too inviting for improper use and Oak does not
> have any context around when this url is exposed for what time it is used.
>
> So instead of having a generic adaptTo API at JCR level we can have a
> BlobProcessor callback (Approach #B). Below is more of a strawman proposal.
> Once we have a consensus then we can go over the details
>
> interface BlobProcessor {
>void process(AdaptableBlob blob);
> }
>
> Where AdaptableBlob is
>
> public interface AdaptableBlob {
>  AdapterType adaptTo(Class type);
> }
>
> The BlobProcessor instance can be passed via BlobStore API. So client
> would look for a BlobStore service (so use the Oak level API) and pass it
> the ContentIdentity of JCR Binary aka blobId
>
> interface BlobStore{
>  void process(String blobId, BlobProcessor processor)
> }
>
> The approach ensures
>
> 1. That any blob handle exposed is only guaranteed for the duration
> of  'process' invocation
> 2. There is no guarantee on the utility of blob handle (File, S3 Object)
> beyond the callback. So one should not collect the passed File handle for
> later use
>
> Hopefully this should address some of the concerns raised in this thread.
> Looking forward to feedback :)
>
> Chetan Mehrotra
>
> On Mon, May 9, 2016 at 6:24 PM, Michael Dürig <mdue...@apache.org> wrote:
>
>>
>>
>> On 9.5.16 11:43 , Chetan Mehrotra wrote:
>>
>>> To highlight - As mentioned earlier the user of proposed api is tying
>>> itself to implementation details of Oak and if this changes later then
>>> that
>>> code would also need to be changed. Or as Ian summed it up
>>>
>>> if the API is introduced it should create an out of band agreement with
>>>>
>>> the consumers of the API to act responsibly.
>>>
>>
>> So what does "to act responsibly" actually means? Are we even in a
>> position to precisely specify this? Experience tells me that we only find
>> out about those semantics after the fact when dealing with painful and
>> expensive customer escalations.
>>
>> And even if we could, it would tie Oak into very tight constraints on how
>> it has to behave and how not. Constraints that would turn out prohibitively
>> expensive for future evolution. Furthermore a huge amount of resources
>> would be required to formalise such constraints via test coverage to guard
>> against regressions.
>>
>>
>>
>>> The method is to be used for those important case where you do rely on
>>> implementation detail to get optimal performance in very specific
>>> scenarios. Its like DocumentNodeStore making use of some Mongo specific
>>> API
>>> to perform some important critical operation to achieve better
>>> performance
>>> by checking if the underlying DocumentStore is Mongo based.
>>>
>>
>> Right, but the Mongo specific API is a (hopefully) well thought through
>> API where as with your proposal there are a lot of open questions and
>> concerns as per my last mail.
>

Re: [VOTE] Release Apache Jackrabbit Oak 1.2.14

2016-04-19 Thread Chetan Mehrotra
On Wed, Apr 20, 2016 at 10:25 AM, Amit Jain <am...@apache.org> wrote:

>   [ ] +1 Release this package as Apache Jackrabbit Oak 1.2.14


All checks ok

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig <mdue...@apache.org> wrote:
> I would suggest to add an new, internal mechanism to CommitInfo for your
> purpose.

So introduce a new CommitAttributes instance which would be returned
by CommitInfo ... ?

Chetan Mehrotra


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu
<alex.parvule...@gmail.com> wrote:
> sounds interesting, this looks like a good option.
>

Now comes the hard part ... what should be the name of this new
interface ;) ContextualIndexEditorProvider?

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
That would depend on the CommitHook impl which client code would not
be aware of. And commit hook would also know only as commit traversal
is done. So it needs to be some mutable state
Chetan Mehrotra


On Wed, Aug 3, 2016 at 8:27 PM, Michael Dürig <mdue...@apache.org> wrote:
>
> Couldn't we keep the map immutable and instead add some "WhateverCollector"
> instances as values? E.g. add a AffectedNodeTypeCollector right from the
> beginning?
>
> Michael
>
>
>
> On 3.8.16 4:06 , Chetan Mehrotra wrote:
>>
>> So would it be ok to make the map within CommitInfo mutable ?
>> Chetan Mehrotra
>>
>>
>> On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig <mdue...@apache.org> wrote:
>>>
>>>
>>>>
>>>> #A -Probably we can introduce a new type CommitAttributes which can be
>>>> attached to CommitInfo and which can be modified by the CommitHooks.
>>>> The CommitAttributes can then later be accessed by Observer
>>>
>>>
>>>
>>> This is already present via the CommitInfo.info map. It is even used in a
>>> similar way. See CommitInfo.getPath() and its usages. AFAIU the only part
>>> where your cases would differ is that the information is assembled by
>>> some
>>> commit hooks instead of being provided at the point the commit was
>>> initiated.
>>>
>>>
>>> Michael


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
Opened OAK-4640 to track this
Chetan Mehrotra


On Wed, Aug 3, 2016 at 9:36 PM, Michael Dürig <mdue...@apache.org> wrote:
>
>
> On 3.8.16 5:58 , Chetan Mehrotra wrote:
>>
>> On Wed, Aug 3, 2016 at 8:57 PM, Michael Dürig <mdue...@apache.org> wrote:
>>>
>>> I would suggest to add an new, internal mechanism to CommitInfo for your
>>> purpose.
>>
>>
>> So introduce a new CommitAttributes instance which would be returned
>> by CommitInfo ... ?
>
>
> Probably the best of all ugly solutions yes ;-) (Meaning I don't have a
> better idea neither...)
>
> Michael
>
>>
>> Chetan Mehrotra
>>
>


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
Opened OAK-4641 for this enhancement
Chetan Mehrotra


On Wed, Aug 3, 2016 at 8:00 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> On Wed, Aug 3, 2016 at 7:52 PM, Alex Parvulescu
> <alex.parvule...@gmail.com> wrote:
>> sounds interesting, this looks like a good option.
>>
>
> Now comes the hard part ... what should be the name of this new
> interface ;) ContextualIndexEditorProvider?
>
> Chetan Mehrotra


Re: Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)

2016-08-04 Thread Chetan Mehrotra
I have updated OAK-4642 with one more option.

===
O4 - Similar to O2 but here instead of modifying the existing
IndexUpdateCallback we can introduce a new interface
ContextualCallback which extends IndexUpdateCallback and provide
access to IndexingContext. Editor provider implementation can then
check if the callback implements this new interface and then cast it
and access the context. So only those client which are interested in
new capability make use of this
===

So provide your feedback there or in this thread
Chetan Mehrotra


On Thu, Aug 4, 2016 at 12:35 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> Hi Team,
>
> As a follow up to previous mail around "Using same index definition
> for both async and sync indexing" wanted to discuss the next step. We
> need to provide a way to pass indexing related state to
> IndexEditorProvider (OAK-4642)
>
> Over the period of time I have seen need for extra state like
>
> 1. reindexing - Currently the index implementation use some heuristic
> like check before root state being empty to determine if they are
> running in reindexing mode
> 2. indexing mode - sync or async
> 3. index path of the index (see OAK-4152)
> 4. CommitInfo (see OAK-4640)
>
> For #1 and #3 we have done some kind of workaround but it would be
> better to have a first class support for that.
>
> So we would need to introduce some sort of IndexingContext and have
> the api for IndexEditorProvider like below
>
> =
> @CheckForNull
> Editor getIndexEditor(
> @Nonnull String type, @Nonnull NodeBuilder definition,
> @Nonnull NodeState root,
> @Nonnull IndexingContext context) throws CommitFailedException;
> =
>
> To introduce such a change I see 3 options
>
> * O1 - Introduce a new interface which takes an {{IndexingContext}}
> instance which provide access to such datapoints. This would require
> some broader change
> ** Whereever the IndexEditorProvider is invoked it would need to check
> if the instance implements new interface. If yes then new method needs
> to be used
>
> Overall it introduces noise.
>
> * O2 - Here we can introduce such data points as part of callback
> interface. With this we would need to implement such methods in places
> where code constructs the callback
>
> * O3 - Make a backward incompatible change and just modify the
> existing interface and adapt the various implementation
>
> I am in favour of going for O3 and make this backward compatible change
>
> Thoughts?
>
> Chetan Mehrotra


Provide a way to pass indexing related state to IndexEditorProvider (OAK-4642)

2016-08-04 Thread Chetan Mehrotra
Hi Team,

As a follow up to previous mail around "Using same index definition
for both async and sync indexing" wanted to discuss the next step. We
need to provide a way to pass indexing related state to
IndexEditorProvider (OAK-4642)

Over the period of time I have seen need for extra state like

1. reindexing - Currently the index implementation use some heuristic
like check before root state being empty to determine if they are
running in reindexing mode
2. indexing mode - sync or async
3. index path of the index (see OAK-4152)
4. CommitInfo (see OAK-4640)

For #1 and #3 we have done some kind of workaround but it would be
better to have a first class support for that.

So we would need to introduce some sort of IndexingContext and have
the api for IndexEditorProvider like below

=
@CheckForNull
Editor getIndexEditor(
@Nonnull String type, @Nonnull NodeBuilder definition,
@Nonnull NodeState root,
@Nonnull IndexingContext context) throws CommitFailedException;
=

To introduce such a change I see 3 options

* O1 - Introduce a new interface which takes an {{IndexingContext}}
instance which provide access to such datapoints. This would require
some broader change
** Whereever the IndexEditorProvider is invoked it would need to check
if the instance implements new interface. If yes then new method needs
to be used

Overall it introduces noise.

* O2 - Here we can introduce such data points as part of callback
interface. With this we would need to implement such methods in places
where code constructs the callback

* O3 - Make a backward incompatible change and just modify the
existing interface and adapt the various implementation

I am in favour of going for O3 and make this backward compatible change

Thoughts?

Chetan Mehrotra


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
Couple of points around the motivation, target usecase around Hybrid
Indexing and Oak indexing in general.

Based on my understanding of various deployments. Any application
based on Oak has 2 type of query requirements

QR1. Application Query - These mostly involve some property
restrictions and are invoked by code itself to perform some operation.
The property involved here in most cases would be sparse i.e. present
in small subset of whole repository content. Such queries need to be
very fast and they might be invoked very frequently. Such queries
should also be more accurate and result should not lag repository
state much.

QR2. User provided query - These queries would consist of both or
either of property restriction and fulltext constraints. The target
nodes may form majority part of overall repository content. Such
queries need to be fast but given user driven need not be very fast.

Note that speed criteria is very subjective and relative here.

Further Oak needs to support deployments

1. On single setup - For dev, prod on SegmentNodeStore
2. Cluster Setup on premise
3. Deployment in some DataCenter

So Oak should enable deployments where for smaller setups it does not
require any thirdparty system while still allow plugging in a dedicate
system like ES/Solr if need arises. So both usecases need to be
supported.

And further even if it has access to such third party server it might
be fine to rely on embedded Lucene for #QR1 and just delegate queries
under #QR2 to remote. This would ensure that query results are still
fast for usage falling under #QR1.

Hybrid Index Usecase
-

So far for #QR1 we only had property indexes and to an extent Lucene
based property index where results lag repository state and lag might
be significant depending on load.

Hybrid index aim to support queries under  #QR1 and can be seen as
replacement for existing non unique property indexes. Such indexes
would have lower storage requirement and would not put much load on
remote storage for execution. Its not meant as a replacement for
ES/Solr but then intends to address different type of usage

Very large Indexes
-

For deployments having very large repository Solr or ES based indexes
would be preferable and there oak-solr can be used (some day oak-es!)

So in brief Oak should be self sufficient for smaller deployment and
still allow plugging in Solr/ES for large deployment and there also
provide a choice to admin to configure a sub set of index for such
usage depending on the size.






Chetan Mehrotra


On Thu, Aug 11, 2016 at 1:59 PM, Ian Boston <i...@tfd.co.uk> wrote:
> Hi,
>
> On 11 August 2016 at 09:14, Michael Marth <mma...@adobe.com> wrote:
>
>> Hi Ian,
>>
>> No worries - good discussion.
>>
>> I should point out though that my reply to Davide was based on a
>> comparison of the current design vs the Jackrabbit 2 design (in which
>> indexes were stored locally). Maybe I misunderstood Davide’s comment.
>>
>> I will split my answer to your mail in 2 parts:
>>
>>
>> >
>> >Full text extraction should be separated from indexing, as the DS blobs
>> are
>> >immutable, so is the full text. There is code to do this in the Oak
>> >indexer, but it's not used to write to the DS at present. It should be
>> done
>> >in a Job, distributed to all nodes, run only once per item. Full text
>> >extraction is hugely expensive.
>>
>> My understanding is that Oak currently:
>> A) runs full text extraction in a separate thread (separate form the
>> “other” indexer)
>> B) runs it only once per cluster
>> If that is correct then the difference to what you mention above would be
>> that you would like the FT indexing not be pinned to one instance but
>> rather be distributed, say round-robin.
>> Right?
>>
>
>
> Yes.
>
>
>>
>>
>> >Building the same index on every node doesn't scale for the reasons you
>> >point out, and eventually hits a brick wall.
>> >http://lucene.apache.org/core/6_1_0/core/org/apache/
>> lucene/codecs/lucene60/package-summary.html#Limitations.
>> >(Int32 on Document ID per index). One of the reasons for the Hybrid
>> >approach was the number of Oak documents in some repositories will exceed
>> >that limit.
>>
>> I am not sure what you are arguing for with this comment…
>> It sounds like an argument in favour of the current design - which is
>> probably not what you mean… Could you explain, please?
>>
>
> I didn't communicate that very well.
>
> Currently Lucene (6.1) has a limit of Int32 to the number of documents it
> can store in an index, IIUC There is a long term desire to increase that
> but using Int64 but no long term commitment as its probably significant
> work given arrays in Java are indexed with Int32.
>
> The Hybrid approach doesn't help the potential Lucene brick wall, but one
> motivation for looking at it was the number of Oak Documents including
> those under /oak:index which is, in some cases, approaching that limit.
>
>
>
>>
>>
>> Thanks!
>> Michael
>>


Re: Oak Indexing. Was Re: Property index replacement / evolution

2016-08-11 Thread Chetan Mehrotra
On Thu, Aug 11, 2016 at 3:03 PM, Ian Boston <i...@tfd.co.uk> wrote:
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time. That introduces problems.
...
> Both Solr Cloud and ES address this by sharding and
> replicating the indexes, so that all commits are soft, instant and real
> time.

This would really be useful. However I have couple of aspects to clear

Index Update Gurantee


Lets say if commit succeeds and then we update the index and index
update fails for some reason. Then would that update be missed or
there can be some mechanism to recover. I am not very sure about WAL
here that may be the answer here but still confirming.

In Oak with the way async index update works based on checkpoint its
ensured that index would "eventually" contain the right data and no
update would be lost. if there is a failure in index update then that
would fail and next cycle would start again from same base state

Order of index update
-

Lets say I have 2 cluster nodes where same node is being performed

Original state /a {x:1}

Cluster Node N1 - /a {x:1, y:2}
Cluster Node N2 - /a {x:1, z:3}

End State /a {x:1, y:2, z:3}

At Oak level both the commits would succeed as there is no conflict.
However N1 and N2 would not be seeing each other updates immediately
and that would depend on background read. So in this case how would
index update would look like.

1. Would index update for specific paths go to some master which would
order the update
2. Or it would end up with with either of {x:1, y:2} or {x:1, z:3}

Here current async index update logic ensures that it sees the
eventually expected order of changes and hence would be consistent
with repository state.

Backup and Restore
---

Would the backup now involve backup of ES index files from each
cluster node. Or assuming full replication it would involve backup of
files from any one of the nodes. Would the back be in sync with last
changes done in repository (assuming sudden shutdown where changes got
committed to repository but not yet to any index)

Here current approach of storing index files as part of MVCC storage
ensures that index state is consistent to some "checkpointed" state in
repository. And post restart it would eventually catch up with the
current repository state and hence would not require complete rebuild
of index in case of unclean shutdowns


Chetan Mehrotra


Re: svn commit: r1752601 - in /jackrabbit/oak/trunk/oak-segment-tar: pom.xml src/main/java/org/apache/jackrabbit/oak/segment/SegmentWriter.java

2016-07-14 Thread Chetan Mehrotra
On Thu, Jul 14, 2016 at 2:04 PM,  <f...@apache.org> wrote:
>
> +commons-math3

commons-math is a 2.1 MB jar. Would it be possible to avoid embedding
it whole and only have some parts embedded/copied. (See [1] for an
example)

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/SLING-2361


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 12:54 PM, Michael Dürig <mdue...@apache.org> wrote:
> For blocking or time intensive tasks I would go for a dedicated thread pool.

So wrt current issue that means option #B ?

Chetan Mehrotra


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 1:44 PM, Stefan Egli <stefane...@apache.org> wrote:
> I'd go for #A to limit cross-effects between oak and other layers.

Note that for #4 there can be multiple task scheduled. So if a system
has 100 JCR Listeners than there would be 1 task/listener to manage
the time series stats. These should be quick and non blocking though.

All other task are much more critical for repository to function
properly. Hence thoughts to go for #B where we have a dedicated pool
for those 'n' tasks. Where n is much small i.e. number of async lanes
+ 2 from DocumentNodeStore so far. So its easy to size

Chetan Mehrotra


Re: Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-19 Thread Chetan Mehrotra
On Tue, Jul 19, 2016 at 1:21 PM, Michael Dürig <mdue...@apache.org> wrote:
> Not sure as I'm confused by your description of that option. I don't
> understand which of 1, 2, 3 and 4 would run in the "default pool" and which
> should run in its own dedicated pool.

#1, #2 and #3 would run in dedicated pool and each using same pool.
Pool name would be 'oak'. Also see OAK-4563 for the patch

While for #4 default pool would be used as those are non blocking and
short tasks

Chetan Mehrotra


Specifying threadpool name for periodic scheduled jobs (OAK-4563)

2016-07-18 Thread Chetan Mehrotra
Hi Team,

While running Oak in Sling we rely on Sling Scheduler [1] to execute
the periodic jobs. By default Sling Scheduler uses a pool of 5 threads
to run all such periodic jobs in the system. Recently we saw an issue
OAK-4563 where due to some reason the pool got exhausted for long time
and that prevented the async indexing job to run for long time and
hence affected the query result.

To address that Sling now provides a new option (SLING-5831) where one
can specify the pool name to be used to execute a specific job. So we
can specify custom pool which can be used for Oak related jobs.

Now currently in Oak we use following types of periodic jobs

1. Async indexing - (Cluster Singleton)
2. Document Store - Journal GC (Cluster Singleton)
3. Document Store - LastRevRecovery
4. Statistic Collection - For timeseries data update in ChangeProcessor,
SegmentNodeStore GCMonitor

Now should we use

A - one single pool for all of the above
B - use the pool only for 1-3. The default pool would be of 5. So even
if #2 #3 are running
  it would not hamper #1

Assuming #4 is not that critical to run and may consist of lots of jobs.

My suggestion would be to go for #B

Chetan Mehrotra
[1] 
https://sling.apache.org/documentation/bundles/scheduler-service-commons-scheduler.html


Re: Why is nt:resource referencable?

2016-07-20 Thread Chetan Mehrotra
On Wed, Jul 20, 2016 at 2:49 PM, Bertrand Delacretaz
<bdelacre...@apache.org> wrote:
> but the JCR spec (JSR 283 10 August 2009) only has
>
>   [nt:resource] > mix:mimeType, mix:lastModified
> primaryitem jcr:data
> - jcr:data (BINARY) mandatory

Thats interesting. Did not knew its not mandated in JCR 2.0. However
looks like for backward compatibility we need to support it. See [1]
where this was changed

@Marcel - I did not understood JCR-2170 properly. But any chance we
can switch to newer version of nt:resource and do not modify existing
nodes and let the new definition effect/enforced only on new node.

Chetan Mehrotra
[1] 
https://issues.apache.org/jira/browse/JCR-2170?focusedCommentId=12754941=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12754941


Re: Why is nt:resource referencable?

2016-07-20 Thread Chetan Mehrotra
On Wed, Jul 20, 2016 at 4:04 PM, Marcel Reutegger <mreut...@adobe.com> wrote:
> Maybe we would keep the jcr:uuid property on the referenceable node and add
> the mixin?

What if we do not add any mixin and just have jcr:uuid property
present. The node would anyway be indexed so search would still work.
Not sure if API semantics require that nodes lookedup by UUID have to
be referenceable.

For now I think oak:Resource is safest way. But just exploring other
options if possible!


Chetan Mehrotra


Re: multilingual content and indexing

2016-07-12 Thread Chetan Mehrotra
On Tue, Jul 12, 2016 at 3:53 PM, Lukas Kahwe Smith <sm...@pooteeweet.org> wrote:
>> Alternatively, you can create different index definitions for each subtree 
>> (see [1]), e.g. Using the “includedPaths” property. This would lead to 
>> smaller indexes at the downside that you would have to create an index 
>> definition if you add a new language tree.

Another way would be to have your index definition under each node

/content/en/oak:index/fooIndex
/content/jp/oak:index/fooIndex

And have each index config analyzer configured as per the language.

Chetan Mehrotra


Re: [proposal] New oak:Resource nodetype as alternative to nt:resource

2016-07-18 Thread Chetan Mehrotra
Thanks for the feedback. Opened OAK-4567 to track the change


On Mon, Jul 18, 2016 at 12:14 PM, Angela Schreiber <anch...@adobe.com> wrote:
> Additionally or alternatively we could create a separate method (e.g.
> putOakFile
> or putOakResource or something explicitly mentioning the non-referenceable
> nature of the content) that uses 'oak:Resource' and state that it requires
> the
> node type to be registered and will fail otherwise... that would be as easy
> to use as 'putFile', which is IMO important.

@Angela - What about Justin's suggestion later around changing the
current putFile implementation. Have it use oak:Resource is present
otherwise fallback to nt:resource. This can lead to compatibility
issue though as javadoc of putFile says it would use nt:resource

Chetan Mehrotra


[proposal] New oak:Resource nodetype as alternative to nt:resource

2016-07-15 Thread Chetan Mehrotra
In most cases where code uses JcrUtils.putFile [1] it leads to
creation of below content structure

+ foo.jpg (nt:file)
   + jcr:content (nt:resource)
   - jcr:data

Due to usage of nt:resource each nt:file node creates a entry in uuid
index as nt:resource is referenceable [2]. So if a system has 1M
nt:file nodes then we would have 1M entries in /oak:index/uuid as in
most cases the files are created via [1] and hence all such files are
referenceable

The nodetype defn for nt:file [3] does not mandate that the
requirement for jcr:content being nt:resource.

So should we register a new oak:Resource nodetype which is same as
nt:resource but not referenceable. This would be similar to
oak:Unstructured.

Also what should we do for [1]. Should we provide an overloaded method
which also accepts a nodetype for jcr:content node as it cannot use
oak:Resource

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-jcr-commons/src/main/java/org/apache/jackrabbit/commons/JcrUtils.java#L1062

[2]
[nt:resource] > mix:lastModified, mix:mimeType, mix:referenceable
  primaryitem jcr:data
   - jcr:data (binary) mandatory

[3]

[nt:file] > nt:hierarchyNode
  primaryitem jcr:content
  + jcr:content (nt:base) mandatory


[multiplex] - Review the proposed SPI interface MountInfoProvider and Mount for OAK-3404

2016-06-28 Thread Chetan Mehrotra
Hi Team,

As we start on integrating the work done related to multiplexing
support to trunk I would like your thoughts on new SPI interface
MountInfoProvider [1] being proposed as part of OAK-3404.

This would be used by various part of Oak to determine the Mount information.

Kindly provide your feedback on the issue.

Chetan Mehrotra
[1] 
https://github.com/rombert/jackrabbit-oak/tree/features/docstore-multiplex/oak-core/src/main/java/org/apache/jackrabbit/oak/spi/mount


Re: [Oak origin/1.4] Apache Jackrabbit Oak matrix - Build # 992 - Still Failing

2016-06-27 Thread Chetan Mehrotra
On Sat, Jun 25, 2016 at 10:24 AM, Apache Jenkins Server
<jenk...@builds.apache.org> wrote:
> Caused by: java.lang.IllegalArgumentException: No enum constant 
> org.apache.jackrabbit.oak.commons.FixturesHelper.Fixture.SEGMENT_TAR
> at java.lang.Enum.valueOf(Enum.java:238)
> at 
> org.apache.jackrabbit.oak.commons.FixturesHelper$Fixture.valueOf(FixturesHelper.java:45)
> at 
> org.apache.jackrabbit.oak.commons.FixturesHelper.(FixturesHelper.java:58)

The test are failing due to above issue. Is this related to presence
of new segment-tar module in trunk but not in branch?

Chetan Mehrotra


Re: Way to capture metadata related to commit as part of CommitInfo from within CommitHook

2016-08-03 Thread Chetan Mehrotra
So would it be ok to make the map within CommitInfo mutable ?
Chetan Mehrotra


On Wed, Aug 3, 2016 at 7:29 PM, Michael Dürig <mdue...@apache.org> wrote:
>
>>
>> #A -Probably we can introduce a new type CommitAttributes which can be
>> attached to CommitInfo and which can be modified by the CommitHooks.
>> The CommitAttributes can then later be accessed by Observer
>
>
> This is already present via the CommitInfo.info map. It is even used in a
> similar way. See CommitInfo.getPath() and its usages. AFAIU the only part
> where your cases would differ is that the information is assembled by some
> commit hooks instead of being provided at the point the commit was
> initiated.
>
>
> Michael


OAK-4475 - CI failing on branches due to unknown fixture SEGMENT_TAR

2016-06-29 Thread Chetan Mehrotra
Hi Team,

Sometime back build was failing for branches because of new trunk only
fixture usage of SEGMENT_TAR. As this fixture was not present on the
branch it caused the build to fail.

My initial attempt to fix this was to ignore exception when
FixturesHelper resolves enum like SEGMENT_TAR on branch [1]. With this
build comes fine but I have a hunch that current fix would lead to all
fixtures getting activated and that would cause waste of time

A- Which solution to use


So have 2 options

1. Treat SEGMENT_TAR as SEGMENT_MK for branches - This would cause
test to run 2 times against SEGMENT_MK

2. Create separate build profile for branches

B - Use of nsfixtures system property
==

However before doing that I am trying to understand how the fixture
get set. From CI logs the command that gets fired is

---
/home/jenkins/tools/maven/apache-maven-3.2.1/bin/mvn
-Dnsfixtures=DOCUMENT_NS -Dlabel=Ubuntu -Djdk=jdk1.8.0_11
-Dprofile=integrationTesting clean verify -PintegrationTesting
-Dsurefire.skip.ut=true -Prdb-derby -DREMOVEMErdb.jdbc-
---

It sets system property 'nsfixtures' to required fixture. However in
our parent pom we rely on system property 'fixtures' which defaults to
SEGMENT_MK. And in no place we override 'fixtures' in our CI.  Looking
at all things it appears to me that currently all test are only
running against SEGMENT_MK fixture and other fixtures are not getting
used. But then exception should not have come with usage of
SEGMENT_TAR. So I am missing some connection here in the build process

>From my test it appears that if we specify a system property in mvn
command line and same property is configured in maven-surefire-plugin
then property specified in command line is used and one in pom.xml is
ignored. That would explain why settings in pom.xml are not used for
fixture

So what should we opt for #A?

My vote would be for A1!

Chetan Mehrotra

[1] 
https://github.com/apache/jackrabbit-oak/commit/319433e9400429592065d4b3997dd31f93b6c549
[2] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-parent/pom.xml#L289


maven-failsafe-plugin

${test.opts}

${known.issues}
${mongo.host}
${mongo.port}
${mongo.db}
${mongo.db2}
${fixtures}
${project.build.directory}/derby.log





Re: normalising the rdb database schema

2016-08-16 Thread Chetan Mehrotra
Hi Tomek,

I like the idea of revisiting our current schema based on usage so
far. However couple of points around potential issue with such a
normalized approach

- This approach would lead to a thin and lng table. As noted in
[1] in a small repo ~14 M nodes we have ~26 M properties. With
multiple revisions (GC takes some time) this can go higher. This would
then increase the memory requirement for id index. Memory consumption
increases further with id+key+revision index. For any db to perform
optimally the index should fit in ram. So such such a design would
possibly reduce the max size of repository which can be supported
(compared to older one) for given memory

- The read for specific id can be done in 1 remote call. But that
would involve select across multiple rows which might increase the
time taken as it would involve 'm' index lookup and then 'm' reads of
row data for any node having 'n' properties (m > n assuming multiple
revision for property present)

May be we should explore the json support being introduced in multiple
dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
Problem here is that we would need DB specific implementation and also
increases the testing effort!

> we can better use the database features, as now the DBE is aware about the 
> document internal structure (it’s not a blob anymore). Eg. we can fetch only 
> a few properties.

In most cases the kind of properties stored in blob part of db row are
always read as a whole.

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/OAK-4471
[2] 
http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/
[3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
[4] https://docs.oracle.com/database/121/ADXDB/json.htm
[5] https://www.postgresql.org/docs/9.3/static/functions-json.html
[6] https://dev.mysql.com/doc/refman/5.7/en/json.html


On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <mma...@adobe.com> wrote:
> Hi Tomek,
>
> I like the idea (agree with Vikas’ comments / cautions as well).
>
> You are hinting at expected performance differences (maybe faster or slower 
> than the current approach). That would probably be worthwhile to investigate 
> in order to assess your idea.
>
> One more (hypothetical at this point) advantage of your approach: we could 
> utilise DB-native indexes as a replacement for property indexes.
>
> Cheers
> Michael
>
>
>
> On 16/08/16 07:42, "Tomek Rekawek" <reka...@adobe.com> wrote:
>
>>Hi Vikas,
>>
>>thanks for the reply.
>>
>>> On 16 Aug 2016, at 14:38, Vikas Saurabh <vikas.saur...@gmail.com> wrote:
>>
>>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>>> that, most probably, would translate to us having to support both
>>> schemas. I don't feel that it'd easy to flip the switch for existing
>>> setups.
>>
>>That’s true. I think we should take a similar approach here as with the 
>>segment / segment-tar implementations (and we can use oak-upgrade to convert 
>>between them). At least for now.
>>
>>> * DocumentNodeStore implementation very freely touches prop:rev=value
>>> for a given id… […] I think this would get
>>> expensive for index (_id+propName+rev) maintenance.
>>
>>Indeed, probably we’ll have to analyse the indexing capabilities offered by 
>>different database engines more closely, choosing the one that offers good 
>>writing speed.
>>
>>Best regards,
>>Tomek
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>reka...@adobe.com


Re: svn commit: r1781064 - /jackrabbit/oak/trunk/oak-doc/src/site/markdown/query/index-management.md

2017-01-31 Thread Chetan Mehrotra
Hi Thomas,

On Tue, Jan 31, 2017 at 5:07 PM,  <thom...@apache.org> wrote:
> +The following script created the index externalId:
> +
> +// create-externalId.txt
> +// create a unique index on externalId
> +{"print": "check if externalId already exists"}
> +{"xpath": "/jcr:root/oak:index/externalId"}
> +{"if": "$resultSize", "=": 1, "print": "index externalId already exists"}
> +{"$create": []}
> +{"if": "$resultSize", "=": 0, "$create": [1]}
> +{"for": "$create", "do": [
> +{"print": "does not exist; creating..."},
> +{"addNode": "/oak:index/externalId", "node": {
> +"jcr:primaryType": "oak:QueryIndexDefinition",
> +"{Name}propertyNames": ["rep:externalId"],
> +"type": "property",
> +"unique": true
> +}},
> +{"session": "save"},
> +{"print": "done; index is now:"},
> +{"xpath": "/jcr:root/oak:index/externalId", "depth": 2}
> +]}
> +exit

Scripting in JSON looks interesting!

However would like to understand the approach here and how it should
be used. Oak Console already uses Groovy and can execute those script
via ":load" construct. We already use this with customer setups to
execute scripts hosted on github [1], [2] So not sure why we need to
use this approach and how it meets the requirement for OAK-5324


Chetan Mehrotra
[1] https://gist.github.com/stillalex/e7067bcb86c89bef66c8
[2] https://gist.github.com/chetanmeh/d7588d96a839dd2d26760913e4055215


Close a CI issue when resolving it as duplicate

2017-02-15 Thread Chetan Mehrotra
It appears that CI issues which are resolved as duplicate but not
closed are still updated upon each successful build. So to reduce the
noise it would be good to also close the issue when resolving a CI
issue as duplicate

Chetan Mehrotra


Re: Strong all documents under Root - severe slowness on start-up

2017-02-23 Thread Chetan Mehrotra
Can you provide a thread dump around startup time where you see Oak is
reading all child nodes?
Chetan Mehrotra


On Fri, Feb 24, 2017 at 2:26 AM, Eugene Prystupa
<eugene.pryst...@gmail.com> wrote:
> Thanks, Michael.
>
> I should have included more details in the original email.
> We are on 1.4.10 version of Jackrabbit Oak, we are using Mongo backend.
>
>
> On Thu, Feb 23, 2017 at 3:40 PM, Michael Dürig <mdue...@apache.org> wrote:
>
>>
>>
>> On 23.02.17 19:11, Eugene Prystupa wrote:
>>
>>> We are seeing severe delays on start-up (20 minutes+) when repository is
>>> created (new Jcr(oak).createRepository()).
>>>
>>
>> Regardless of the content structure, 20 min. seems off. What back-end are
>> you on? Which version of Oak is this?
>>
>> Michael
>>
>
>
>
> --
> Thanks,
> Eugene


Re: Merging OAK-5784 into 1.6.1

2017-02-24 Thread Chetan Mehrotra
Changes look fine however one aspect might cause issue

RestrictionImpl#hashCode -> PropertyValues#hashCode ->
PropertyStateValue#hashCode


private String getInternalString() {
StringBuilder sb = new StringBuilder();
Iterator iterator = getValue(Type.STRINGS).iterator();
while (iterator.hasNext()) {
sb.append(iterator.next());
if (iterator.hasNext()) {
sb.append(",");
}
}
return sb.toString();
}

@Override
public int hashCode() {
return getType().tag() ^ getInternalString().hashCode();
}


Here it tries to get value as STRINGS which leads to
PropertyState#getValue(Type.STRINGS) which would lead to a Binary
getting coerced to String in Conversions#convert(Blob) which would
lead to load of whole binary. Now I am not sure if PropertyState in
RestrictionImpl is applicable for Binary property also

Probably PropertyStateValue#hashCode should take care of Binary
properties and thats why PropertyState#hashCode does not take into
account the value
Chetan Mehrotra


On Fri, Feb 24, 2017 at 2:34 PM, Angela Schreiber <anch...@adobe.com> wrote:
> hi oak-devs
>
> i would like to merge another improvement into the 1.6.1 branch:
> https://issues.apache.org/jira/browse/OAK-5784
>
> in addition to additional tests i run the AceCreationTest benchmark and
> attached the results to the issue.
> however, having some extra pair of eyes would be appreciated in order to
> limit the risk of regressions.
>
> thanks
> angela
>
>


Re: Supporting "resumable" operations on a large tree

2017-02-24 Thread Chetan Mehrotra
Hi Thomas,

On Fri, Feb 24, 2017 at 1:09 PM, Thomas Mueller <muel...@adobe.com> wrote:
> 9) Sorting of path is needed, so that the repository can be processed bit
> by bit by bit. For that, the following logic is used, recursively: read at
> most 1000 child nodes. If there are more than 1000, then this subtree is
> never split but processed in one step (so many child nodes can still lead
> to large transactions, unfortunately). If less than 1000 child nodes, then
> the names of all child nodes are read, and processed in sorted order
> (sorted by node name).

This should work! So we can implement a "paginated tree traversal" via
above approach and similar approach can be used for Lucene indexes.
Would be good to record this in OAK-2556 (or better a new issue) and
we can look into implementing it in those parts which do such large
transaction (reindex async index, reindex sync index, content
migration in sidegrade) etc

Chetan Mehrotra


Re: Merging OAK-5784 into 1.6.1

2017-02-24 Thread Chetan Mehrotra
On Fri, Feb 24, 2017 at 4:10 PM, Angela Schreiber <anch...@adobe.com> wrote:
> maybe this is
> another indication that we should think about having an implementation
> with plugins.memory and deal with the binary topic there.

+1

Then we can go with current fix (and also merge to 1.6) and later
backport the change to 1.6 branch


Chetan Mehrotra


Re: CommitEditors looking for specific child node like oak:index, rep:cugPolicy leads to lots of redundant remote calls

2017-02-23 Thread Chetan Mehrotra
I realized now that I logged an issue for this recently OAK-5511 which
mentioned similar approach. So lets move this discussion there
Chetan Mehrotra


On Thu, Feb 23, 2017 at 7:06 PM, Thomas Mueller <muel...@adobe.com> wrote:
> Hi,
>
>>I like Marcel proposal for "enforcing" use of mixin on parent node to
>>indicate that it can have a child node of 'oak:index'. So we can
>>leverage mxin 'mix:indexable' (OAK-3725) to mark such parent nodes
>>(like root) and IndexUpdate would only look for 'oak:index' node if
>>current node has that mixin.
>
> Ah I didn't know about OAK-3725.
>
> I'm a bit worried that we mix different aspects together, not sure which
> is better.
>
> "oak:Indexable" is visible, so it can be added and _removed_ by the user.
> So when trying to remove that mixin, we would need to check there is no
> oak:index child node with nodetype oak:QueryIndexDefinition. We need to
> check the nodetype hierarchy. On the other hand, possibly we can enforce
> that the parent node of oak:index is oak:Indexable (can we?)
>
> I'm not saying with a hidden property hidden property ":hasOakIndex"
> (automatically set and removed) it would be painless. For example when
> moving an oak:index node to a new parent, the setting has to be changed at
> both the original and the new parents.
>
> Regards,
> Thomas
>
>


Re: CommitEditors looking for specific child node like oak:index, rep:cugPolicy leads to lots of redundant remote calls

2017-02-23 Thread Chetan Mehrotra
I like Marcel proposal for "enforcing" use of mixin on parent node to
indicate that it can have a child node of 'oak:index'. So we can
leverage mxin 'mix:indexable' (OAK-3725) to mark such parent nodes
(like root) and IndexUpdate would only look for 'oak:index' node if
current node has that mixin.

This would avoid the extra calls. For new setups we can enforce this
and for upgrade we can migrate the existing code by using nodetype
index to update all such "indexable" nodes
Chetan Mehrotra


On Thu, Feb 23, 2017 at 4:47 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> On Wed, Feb 22, 2017 at 8:21 PM, Davide Giannella <dav...@apache.org> wrote:
>> Did you mean for ALL the nodes, or only specific nodes?
>>
>> Any way you're suggesting something like the following flow:
>>
>> 1) user call nodebuilder.child(":index")
>> 2) lookup in hidden property
>> 3) if not there, leverage the existing code
>>
>> If so I guess the property has been already fetched and it does not
>> require roundtrips towards the DB. Am I right?
>
> Currently the lookup is being done for ALL nodes. So IndexUpdate class
> does following on each changed node
>
> --
>> @Override
> public void enter(NodeState before, NodeState after)
> throws CommitFailedException {
> collectIndexEditors(builder.getChildNode(INDEX_DEFINITIONS_NAME),
> before);
> --
>
> Which transalates into checking if the current node has a child node
> 'oak:index' and this leads to redudant calls.
>
> Chetan Mehrotra


Re: CommitEditors looking for specific child node like oak:index, rep:cugPolicy leads to lots of redundant remote calls

2017-02-23 Thread Chetan Mehrotra
On Wed, Feb 22, 2017 at 8:21 PM, Davide Giannella <dav...@apache.org> wrote:
> Did you mean for ALL the nodes, or only specific nodes?
>
> Any way you're suggesting something like the following flow:
>
> 1) user call nodebuilder.child(":index")
> 2) lookup in hidden property
> 3) if not there, leverage the existing code
>
> If so I guess the property has been already fetched and it does not
> require roundtrips towards the DB. Am I right?

Currently the lookup is being done for ALL nodes. So IndexUpdate class
does following on each changed node

--
> @Override
public void enter(NodeState before, NodeState after)
throws CommitFailedException {
collectIndexEditors(builder.getChildNode(INDEX_DEFINITIONS_NAME),
before);
--

Which transalates into checking if the current node has a child node
'oak:index' and this leads to redudant calls.

Chetan Mehrotra


Re: [DISCUSS] Which I/O statistics should the FileStore expose?

2017-02-13 Thread Chetan Mehrotra
Hi Francesco,

As Julian mentioned it would be good to collects stats as Metrics.
Have a look at DocumentStoreStats which collects some stats around
operations being performed by DocumentStore implementations
Chetan Mehrotra


On Tue, Feb 14, 2017 at 12:37 AM, Julian Sedding <jsedd...@gmail.com> wrote:
> Hi Francesco
>
> I believe you should implement an IOMonitor using the metrics in the
> org.apache.jackrabbit.oak.stats package. These can be backed by
> swappable StatisticsProvider implementations. I believe by default
> it's a NOOP implementation. However, I believe that if the
> MetricStatisticsProvider implementation is used, it automatically
> exposes the metrics via JMX. So all you need to do is feed the correct
> data into a suitable metric. I believe Chetan contributed these, so he
> will know more about the details.
>
> Regards
> Julian
>
>
> On Mon, Feb 13, 2017 at 6:21 PM, Francesco Mari
> <mari.france...@gmail.com> wrote:
>> Hi all,
>>
>> The recently introduced IOMonitor allows the FileStore to trigger I/O
>> events. Callback methods from IOMonitor can be implemented to receive
>> information about segment reads and writes.
>>
>> A trivial implementation of IOMonitor is able to track the following raw 
>> data.
>>
>> - The number of segments read and write operations.
>> - The duration in nanoseconds of every read and write.
>> - The number of bytes read or written by each operation.
>>
>> We are about to expose this kind of information from an MBean - for
>> the sake of discussion, let's call it IOMonitorMBean. I'm currently in
>> favour of starting small and exposing the following statistics:
>>
>> - The duration of the latest write (long).
>> - The duration of the latest read (long).
>> - The number of write operations (long).
>> - The number of read operations (long).
>>
>> I would like your opinion about what's the most useful way to present
>> this data through an MBean. Should just raw data be exposed? Is it
>> appropriate for IOMonitorMBean to perform some kind of aggregation,
>> like sum and average? Should richer data be returned from the MBean,
>> like tabular data?
>>
>> Please keep in mind that this data is supposed to be consumed by a
>> monitoring solution, and not a by human reader.


Re: [DISCUSS] Which I/O statistics should the FileStore expose?

2017-02-14 Thread Chetan Mehrotra
On Tue, Feb 14, 2017 at 1:15 PM, Francesco Mari
<mari.france...@gmail.com> wrote:
> What could be gained
> by adding Metrics to the trivial implementation of IOMonitorMBean
> described above?

The metrics created here are automatically registered in JMX (see
MetricStatisticsProvider)  and also accessible over web ui for example
when running in Sling [1]. The JMX one then can be read by external
monitoring agent

> In example, how the methods in DocumentStoreStats returning time series as
> CompositeData play with other monitoring solutions like ElasticSearch/Kibana?

That is just for convenience i.e. for those setups which do not have
any external monitoring setup installed the time series provides some
insight for stats in past via JMX

Chetan Mehrotra
[1] 
https://sling.apache.org/documentation/bundles/metrics.html#webconsole-plugin


Collect data for test failure in issue itself

2017-01-19 Thread Chetan Mehrotra
It would be helpful while renaming the Hudson created jira issue we
also attach relevant unit-test.log from the module for which test
failed and also record test failure message as comment

This simplifies later analysis as CI only retains reports for past few
builds (not sure on number). For example for some of the older issue
the links to CI are now resulting in 404 (see OAK-5263 for example)

Chetan Mehrotra


Re: svn commit: r1779324 - in /jackrabbit/oak/trunk/oak-segment-tar: ./ src/test/java/org/apache/jackrabbit/oak/segment/standby/ src/test/java/org/apache/jackrabbit/oak/segment/test/

2017-01-18 Thread Chetan Mehrotra
Hi Francesco,

On Wed, Jan 18, 2017 at 7:01 PM,  <f...@apache.org> wrote:
> +package org.apache.jackrabbit.oak.segment.test;
> +
> +import java.net.ServerSocket;
> +
> +import org.junit.rules.ExternalResource;
> +
> +public class TemporaryPort extends ExternalResource {
> +
> +private int port;
> +
> +@Override
> +protected void before() throws Throwable {
> +try (ServerSocket socket = new ServerSocket(0)) {
> +port = socket.getLocalPort();
> +}
> +}
> +
> +public int getPort() {
> +return port;
> +}
> +
> +}

This looks useful and can be used in other places also like in [1].
Would be good if we can move it to oak-commons in
org.apache.jackrabbit.oak.commons.junit package.

Chetan Mehrotra
[1] 
https://issues.apache.org/jira/browse/OAK-5441?focusedCommentId=15823491=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15823491


Specify versions for maven plugins used in build for ensuring stable builds (OAK-5455)

2017-01-15 Thread Chetan Mehrotra
Hi Team,

While checking the test failure reports I realized that we do not
specify versions for some important maven plugin which can make build
behaviour env dependent.

Should this be something we change at this of release i.e. specifying

1. At least current used versions for the plugin
2. Updating the version to latest where possible

Opened OAK-5455 to track this.

Chetan Mehrotra


RepositorySidegrade and commit hooks

2016-08-18 Thread Chetan Mehrotra
Hi,

Does RepositorySidegrade runs all the commit hooks required for
getting a consistent JCR level state like permission editor, property
editor etc

I can such hooks configured for RepositoryUpgrade but not seeing any
such hook configured for RepositorySidegrade

Probably we should also configure same set of hooks?

Chetan Mehrotra


Re: Help with unit tests for JMX stats for S3DataStore

2016-08-18 Thread Chetan Mehrotra
Hi Matt,

It would be easier if you can open an issue and provide your patch
there so that one can have better understanding of what needs to be
tested.

In general we have you can use with MemoryDocumentStore (default used
by DocumentMK builder) and then possibly use Sling OSGi mocks to pick
the registered MBean services. For an example have a look at
SegmentNodeStoreServiceTest which uses OSGi mocks to activate the
service and then pick up the registered services to do the assertion
Chetan Mehrotra


On Fri, Aug 19, 2016 at 6:14 AM, Matt Ryan <o...@mvryan.org> wrote:
> Hi,
>
> I’m working on a patch for Oak that would add some JMX stats for
> S3DataStore.  I’m adding code to register a new Mbean in
> DocumentNodeStoreService (also SegmentNodeStoreService, but let’s just
> worry about the first one for now).
>
> I wanted to create some unit tests to verify that my new JMX stats are
> available via JMX.  The idea I had would be that I would simply instantiate
> a DocumentNodeStoreService, create an S3DataStore, wrap it in a
> DataStoreBlobStore, and bind that in the DocumentNodeStoreService.  Then
> with a JMX connection I could check that my Mbean had been registered,
> which it should have been by this time.
>
>
> This was all going relatively fine until I hit a roadblock in
> DocumentNodeStoreService::registerNodeStore().  The DocumentMKBuilder uses
> a DocumentNodeStore object that I need to mock in order to do the test, and
> I cannot mock DocumentNodeStore because it is a final class.  I tried
> working around that, but ended up hitting another road block in the
> DocumentNodeStore constructor where I then needed to mock a NodeDocument -
> again, can’t mock it because it is a final class.
>
>
> I realize it is theoretically possible to mock final classes using
> PowerMock, although by this point I am starting to wonder if all this
> effort is a good way to use my time or if I should just test my code
> manually.
>
>
> Is it important that DocumentNodeStore be a final class?  If not, how would
> we feel about me simply making the class non-final?  If so, what
> suggestions do you have to help me unit test this thing?  I feel that it
> should be easier to unit test new code than this, so maybe I’m missing
> something.
>
>
> Thanks
>
>
> -Matt Ryan


Re: RepositorySidegrade and commit hooks

2016-08-19 Thread Chetan Mehrotra
Thanks Tomek for confirmation. Opened OAK-4684 to track that
Chetan Mehrotra


On Fri, Aug 19, 2016 at 3:52 PM, Tomek Rekawek <reka...@adobe.com> wrote:
> Hi Chetan,
>
> yes, it seems that this has been overlooked in the OAK-3239 (porting the 
> —include-paths support from RepositoryUpgrade). Feel free to create an issue 
> / commit a patch or let me know if you want me to do it.
>
> Best regards,
> Tomek
>
> --
> Tomek Rękawek | Adobe Research | www.adobe.com
> reka...@adobe.com
>
>> On 19 Aug 2016, at 10:38, Chetan Mehrotra <chetan.mehro...@gmail.com> wrote:
>>
>> For complete migration yes all bits are there. However people also use
>> this for partial incremental migration from source system to target
>> system. In that case include paths are provide for those paths whose
>> content need to be updated. In such a case it can happen that derived
>> content for those paths (property index, permission store entries) do
>> not get updated and that would result in inconsistent state
>> Chetan Mehrotra
>>
>>
>> On Fri, Aug 19, 2016 at 1:59 PM, Alex Parvulescu
>> <alex.parvule...@gmail.com> wrote:
>>> Hi,
>>>
>>> I don't think any extra hooks are needed here. Sidegrade is just a change
>>> in persistence format, all the bits should be there already in the old
>>> repository.
>>>
>>> best,
>>> alex
>>>
>>> On Fri, Aug 19, 2016 at 6:45 AM, Chetan Mehrotra <chetan.mehro...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Does RepositorySidegrade runs all the commit hooks required for
>>>> getting a consistent JCR level state like permission editor, property
>>>> editor etc
>>>>
>>>> I can such hooks configured for RepositoryUpgrade but not seeing any
>>>> such hook configured for RepositorySidegrade
>>>>
>>>> Probably we should also configure same set of hooks?
>>>>
>>>> Chetan Mehrotra
>>>>
>


Re: Using same index definition for both async and sync indexing

2016-08-03 Thread Chetan Mehrotra
On Wed, Aug 3, 2016 at 2:23 PM, Alex Parvulescu
<alex.parvule...@gmail.com> wrote:
> extend the current index definition
> for the 'async' property and allow multiple values.

That should work and looks like natural extension of the flag. Just
that having empty value in array does not look good (might confuse
people in ui). So we can have a marker value to indicate empty

>What about overloading the 'IndexUpdateCallback' with a 'isSync()' method
> coming from the 'IndexUpdate' component. This will reduce the change
> footprint and only components that need to know this information will use
> it.

That can be done. Going forward we also need to pass in CommitInfo or
something like that (see other mail).

Another option can be to have a new interface for IndexEditorProvider
(on same line as AdvancedQueryIndex > QueryIndex). So the editor
implementing new interface would have the extra params passed in. And
there we introduce something like IndexingContext which folds in
IndexUpdateCallback, indexing mode, index path, CommitInfo etc

Chetan Mehrotra


Re: [observation] pure internal or external listeners

2016-09-02 Thread Chetan Mehrotra
On Fri, Sep 2, 2016 at 4:00 PM, Stefan Egli <stefane...@apache.org> wrote:
> If we
> separate listeners into purely internal vs external, then a queue as a whole
> is either purely internal or external and we no longer have this issue.

Not sure here on how this would work. The observation queue is made up
of ContentChange which is a tuple of [root NodeState , CommitInfo
(null for external)]

--- NS1-L---NS2-L--NS3---NS4-L---NS5-L ---NS6-L

--- a  /a/b  - /a/c --- /a/c
 /a/b /a/b
/a/d

So if we dedicate a queue for local changes only what would happen.

If we drop NS3 then while diffing [NS2-L, NS4-L] /a/c would be
reported as "added" and "local". Now we have a listener which listens
for locally added nt:file node such it can start some processing job
for it. Such a listener would then think its a locally added node and
would start a duplicate job

In general I believe

Listener for external Change
--
listener which are listening for external changes are maintaining some
state and purge/refresh it upon detecting change in interested paths.
They would work fine if multiple content change occurrences are merged

[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (external) as they would
still detect the change

An example of this is LuceneIndexObserver which sets queue size to 5
and does not care its local or not. It just interested in if index
node is updated

Listener for local Change
--

Such a listener is more particular about type of change and is doing
some persisted state change i.e. like registering a job, invoking some
third party service to update the value. This listener is only
interested in local as it know same listener is also active on other
cluster node (homogeneous cluster setup) so if a node gets added it
only need to react on the cluster node where it got added.

So for such it needs to be ensured that mixed content changes are not
compacted. So its fine to

[NS4-L, NS5-L] + [NS5-L,NS6-L] = [NS4, NS6] (can be treated as
local with loss of user identity which caused the change)
[NS2-L, NS3]+ [NS3, NS4-L] = [NS2-L, NS4-L] (cannot be treated as local)

Just thinking out loud here to understand the problem space better :)

Chetan Mehrotra


Re: CommitHooks as OSGi Components.

2016-09-12 Thread Chetan Mehrotra
On Mon, Sep 12, 2016 at 3:12 PM, Ian Boston <i...@tfd.co.uk> wrote:
> but if the information that connect a sessionID/userID to the
> paths that are modified is available through some other route, I might be
> able to use something else.

A regular Observer should work for that case. Just register an
instance with service registry and it would be picked up and for non
external event CommitInfo would be present

Chetan Mehrotra


Re: CommitHooks as OSGi Components.

2016-09-12 Thread Chetan Mehrotra
On Mon, Sep 12, 2016 at 2:08 PM, Ian Boston <i...@tfd.co.uk> wrote:
> Unfortunately the IndexProvider route doesn't appear give me the
> information I am after (CommitInfo).

Any details around intended usage? CommitInfo is now exposed via
OAK-4642 to IndexEditorProvider

Chetan Mehrotra


Re: Minimum JDK version

2016-09-12 Thread Chetan Mehrotra
I think Marcel created OAK-4791 for the same. So that should take care
of enforcing this constraing
Chetan Mehrotra


On Mon, Sep 12, 2016 at 4:40 PM, Stefan Seifert <sseif...@pro-vision.de> wrote:
> in sling we use the animal sniffer plugin for exactly this purpose [1].
> it checks that the compiled codes only uses signatures available in the 
> configured jdk.
>
> stefan
>
> [1] http://www.mojohaus.org/animal-sniffer/animal-sniffer-maven-plugin/
>
>>-Original Message-
>>From: Tomek Rekawek [mailto:reka...@adobe.com]
>>Sent: Monday, September 12, 2016 1:06 PM
>>To: oak-dev@jackrabbit.apache.org
>>Subject: Re: Minimum JDK version
>>
>>Hi,
>>
>>the interesting thing here is that we actually compile the code with -
>>source and -target=1.6 in these branches [1][2]. However, the javac still
>>uses the rt.jar coming from the current JDK and it does contain the
>>java.nio package. It seems that the only way to check the API usage
>>correctness is to switch to JDK 1.6.
>>
>>Or maybe there’s some way to validate whether the used packages matches
>>selected JDK version (eg. via some plugin reading the @since javadocs in
>>API classes)?
>>
>>Regards,
>>Tomek
>>
>>[1] https://github.com/apache/jackrabbit-oak/blob/1.4/oak-
>>parent/pom.xml#L97
>>[2] https://github.com/apache/jackrabbit-oak/blob/1.2/oak-
>>parent/pom.xml#L95
>>
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>reka...@adobe.com
>>
>>> On 12 Sep 2016, at 11:42, Davide Giannella <dav...@apache.org> wrote:
>>>
>>> Hello team,
>>>
>>> following the recent mishap about JDK version and releases highlighted
>>> two main issues:
>>>
>>> cannot find jenkins for anything that is not 1.6
>>>
>>> we should enforce the build to build with the minimum required JDK.
>>>
>>> Now for the second point, this is easily achievable. What we have to
>>> decide is whether we want this enforcement done on all the builds, or
>>> only during releases build and checks.
>>>
>>> I'm for having it enforced on all the builds.
>>>
>>> Thoughts?
>>>
>>> Davide
>>>
>>>
>


Re: Infinite loop

2016-09-15 Thread Chetan Mehrotra
Looks like index would need to be reindex. It would be better to
contact Adobe Support as closer analysis would be required
Chetan Mehrotra


On Thu, Sep 15, 2016 at 6:32 PM, Thiago Sanches <tsi...@gmail.com> wrote:
> I removed the index folder but the error persists. I tried to remove the
> "/crx/packmgr/service.jsp/file" node (that was causing the error before)
> But still failing...
>
> 15.09.2016 12:58:31.417 *DEBUG* [pool-7-thread-3]
> org.apache.jackrabbit.oak.plugins.index.AsyncIndexUpdate [async] The index
> update is still failing
> org.apache.jackrabbit.oak.api.CommitFailedException: OakLucene0005: Failed
> to remove the index entries of the removed subtree
> /crx/packmgr/service.jsp/file
>
> I know that we removed and maybe caused some other issue, but I don't know
> why the cause of the first error (before the node deletion).
>
> On Thu, Sep 15, 2016 at 8:51 AM, Thiago Sanches <tsi...@gmail.com> wrote:
>
>> Hello Chetan, good morning.
>>
>> Yes, there is a lot of ".bak" inside segmentstore folder with the creation
>> time close to the "force" restart.
>> I'll try to remove the index folder.
>>
>> Thanks for your help.
>>
>> On Thu, Sep 15, 2016 at 1:44 AM, Chetan Mehrotra <
>> chetan.mehro...@gmail.com> wrote:
>>
>>> On Thu, Sep 15, 2016 at 2:17 AM, Thiago Sanches <tsi...@gmail.com> wrote:
>>> > This issue start to appers after some problemas with disk space and some
>>> > force restarts on AEM.
>>>
>>> Do you see presence of ".bak" files in segmentstore folder post system
>>> restart after unclean shutdown with creation time close to subsequent
>>> restart times? Can you try cleaning local index folder
>>> (repository/index) and restart to see if its resolved. If not would
>>> suggest to followup on Adobe Support portal
>>>
>>>
>>> Chetan Mehrotra
>>>
>>
>>


Re: Faster reference binary handling

2016-09-16 Thread Chetan Mehrotra
I think we fixes have been recently done in this area. However it
would be good to have an integration test for reference check scenario
to ensure that it unnecessarily does not download the blobs
Chetan Mehrotra


On Fri, Sep 16, 2016 at 11:56 AM, Thomas Mueller <muel...@adobe.com> wrote:
> Hi,
>
> Possibly the binary is downloaded from S3 in this case. We have seen
> similar performance issues with datastore GC when using the S3 datastore.
>
> It should be possible to verify this with full thread dumps. Plus we would
> see where exactly the download occurs. Maybe it is checking the length or
> so.
>
>> this API requires Oak to always retrieve the binary value from the DS
>
> I think the problem is in the S3 datastore implementation, and not the
> API. But lets see.
>
> Regards,
> Thomas
>
>
> On 15/09/16 18:04, "Tommaso Teofili" <tommaso.teof...@gmail.com> wrote:
>
>>Hi all,
>>
>>while working with Oak S3 DS I have witnessed slowness (no numbers, just
>>'slow' from a user perspective) in persisting a binary using its
>>reference;
>>although this may be related to some environment specific issue I wondered
>>about the reference binary handling we introduced in JCR-3534 [1].
>>In fact the implementation there requires to do something like
>>
>>ReferenceBinary ref = new SimpleReferenceBinary(referenceString);
>>Binary referencedBinary =
>>session.getValueFactory().createValue(ref).getBinary();
>>node.setProperty("foo", referencedBinary);
>>
>>on the "installation" side.
>>Despite all possible issues in the implementation it seems this API
>>requires Oak to always retrieve the binary value from the DS and then
>>store
>>its value into the node whereas it'd be much better to avoid having to
>>read
>>the value but instead bind it to that referenced binary.
>>
>>ReferenceBinary ref = new SimpleReferenceBinary(referenceString);
>>if (ref.isValid()) { // referenced binary exists in the DS
>>  node.setProperty("foo", ref, Type.BINARY); // set a string with binary
>>type !?
>>}
>>
>>I am not sure if the above code could make sense, probably not, but at
>>least wanted to point out the problem as to seek for possible
>>enhancements.
>>
>>Regards,
>>Tommaso
>>
>>[1] : https://issues.apache.org/jira/browse/JCR-3534
>


Re: Possibility of making nt:resource unreferenceable

2016-10-07 Thread Chetan Mehrotra
On Fri, Oct 7, 2016 at 11:34 AM, Carsten Ziegeler <cziege...@apache.org> wrote:
> Whenever a nt:resource child node of a nt:file node is created, it is
> silently changed to oak:resource.

I like this!

This can be done via an Editor which does this transformation upon
addition of new node. Something which can be easily enabled/disabled
if need arises. With this we would not have make change in many places
like JcrUtil.putFile, WebDav, Vault, Sling Post Servlet, any custom
code creating nt:file say using JcrUtil.putFile.

Chetan Mehrotra


Re: CommitHooks as OSGi Components.

2016-09-14 Thread Chetan Mehrotra
On Thu, Sep 15, 2016 at 1:15 AM, Rob Ryan <rr...@adobe.com> wrote:
> Last I heard even local events can be subject to loss of the user id if so 
> many events are being processed that ‘compaction’ is used to mitigate the 
> load. Is this still the case?
>
> Please don’t point people toward the availability of the user id from events 
> (without full disclaimers) if it will not *always* be available.

Thats the case for JCR level ObservationListener which makes use of
BackgroundObserver. In Ian case he is directly building on top of
Observer and hence can control the compaction aspect.


Chetan Mehrotra


Re: IndexEditorProvider behaviour question.

2016-09-14 Thread Chetan Mehrotra
Note that so far LuceneIndexEditor was used only for async indexing
case and hence invoked only on leader node every 5 sec. So performance
aspects here were not that critical. However with recent work on
Hybrid indexes they would be used in critical path and hence such
aspects are important

On Wed, Sep 14, 2016 at 3:10 PM, Ian Boston <i...@tfd.co.uk> wrote:
> A and B mean that the work of creating the tree and working out the changes
> in a tree will be duplicated roughly n times, where n is the number of
> index definitions.

Here note that diff would be performed only once at any level and
IndexUpdate would then pass them to various editors. However
construction of trees can be avoided and I have opened OAK-4806 for
that now. Oak issue has details around why Tree was used also.

Also with multiple index editors performance does decrease. See
OAK-1273. If we switch to Hybrid Index then this aspects improves a
bit as instead of having 50 different property indexes (with 50 editor
instance for each commit) we can have a single editor with 50 property
definition. This can be seen in benchmark in Hybrid Index (OAk-4412)
by changing the numOfIndexes

If you see any other area of improvement say around unnecessary object
generation then let us know!

Chetan Mehrotra


Re: Infinite loop

2016-09-14 Thread Chetan Mehrotra
On Thu, Sep 15, 2016 at 2:17 AM, Thiago Sanches <tsi...@gmail.com> wrote:
> This issue start to appers after some problemas with disk space and some
> force restarts on AEM.

Do you see presence of ".bak" files in segmentstore folder post system
restart after unclean shutdown with creation time close to subsequent
restart times? Can you try cleaning local index folder
(repository/index) and restart to see if its resolved. If not would
suggest to followup on Adobe Support portal


Chetan Mehrotra


Re: [VOTE] Require JDK7 for Oak 1.4

2016-09-19 Thread Chetan Mehrotra
+1
Chetan Mehrotra


On Mon, Sep 19, 2016 at 12:41 PM, Marcel Reutegger <mreut...@adobe.com> wrote:
> +1
>
> Regards
>  Marcel
>
>
> On 16/09/16 17:16, Julian Reschke wrote:
>>
>> On 2016-09-16 17:11, Davide Giannella wrote:
>>>
>>> ...
>>
>>
>> OK then.
>>
>>   [ ] +1 Yes, require JDK7 for Oak 1.4
>>   [ ] -1 No, continue to support JDK6
>>
>> This majority vote is open for at least 72 hours.
>>
>> Best regards, Julian
>>
>>
>


Re: Stopping long running traversal queries

2016-09-20 Thread Chetan Mehrotra
You can specify a traversal limit via QueryEngineSettingsMBean. This
would be applicable on any running query
Chetan Mehrotra


On Wed, Sep 21, 2016 at 6:26 AM, Pantula Rajesh <praj...@adobe.com> wrote:
> Hi All,
>
> Is there a way to stop long running traversal queries? I was looking if there 
> is any JMX bean which can stop such queries.
>
> Regards,
> Rajesh


Possibility of making nt:resource unreferenceable

2016-10-04 Thread Chetan Mehrotra
<>

Hi Team,

Sometime back we discussed the requirement for oak:Resource as a non
referenceable replacement for nt:resource (OAk-4567). This topic was
also discussed on DL [1] and at that time it was decided that changing
the defaults (making nt:resource non referenceable ) is not possible
and hence applications should switch to other nodetypes while creating
nt:file instance.

Towards that end I started discussion on Sling side as part of
SLING-6090. See [2] for discussion thread. However team there is of
the view that this would require changes in many places and wants us
to think again about changing the defaults.

So question here is

===
Can we change the defaults for nt:resource nodetype to be non
referenceable. This has also been proposed in JCR 2.0. JR2 and Oak
though use the nodetype definition from JCR 1.0
===

To reiterate I am just aiming for a solution here which enables a user
to use a more optimum nodetype and get best performance out of
underlying repository.

Hopefully we can converge on some agreement here :)

Chetan Mehrotra
[1] http://markmail.org/thread/uj2ht4jwdrck7eja
[2] http://markmail.org/thread/77xvjxtx42euhss4
[3] https://java.net/jira/browse/JSR_283-428


Re: Oak 1.5.13 release plan

2016-10-24 Thread Chetan Mehrotra
I would like to have OAK-4975 included. Marked that issue as blocker.
I hope to resolve that today itself
Chetan Mehrotra


On Thu, Oct 20, 2016 at 7:07 PM, Davide Giannella <dav...@apache.org> wrote:
> Hello team,
>
> I'm planning to cut Oak 1.5.13 on Monday 24th.
>
> If there are any objections please let me know. Otherwise I will
> re-schedule any non-resolved issue for the next iteration.
>
> Thanks
> Davide
>
>


Issues waiting for changes in DocumentStore API

2016-10-25 Thread Chetan Mehrotra
We currently have few open issues which are dependent on updating the
DocumentStore API

OAK-3878 - Avoid caching of NodeDocument while iterating in
BlobReferenceIterator
OAK-3001 - Simplify JournalGarbageCollector using a dedicated timestamp property

It would be good if we can decide what the api should be now such that
these issues can be addressed in 1.6 release.

May be we go for usecase specific api?

Chetan Mehrotra


Re: Oak 1.5.13 release plan

2016-10-24 Thread Chetan Mehrotra
On Mon, Oct 24, 2016 at 4:53 PM, Julian Reschke <julian.resc...@gmx.de> wrote:
> Chetan: I see that you marked OAK-3036 as "blocker" for this release -- but
> then, do we have a plan to resolve it in a timely manner?

Missed that. Moved it to next release and would try get it resolved by
that time!

Chetan Mehrotra


[REVIEW] Configuration required for node bundling config for DocumentNodeStore - OAK-1312

2016-10-21 Thread Chetan Mehrotra
Hi Team,

Work for OAK-1312 is now in trunk. To enable this feature user has to
provision some config as content in repository. The config needs to be
created under '/jcr:system/rep:documentStore/bundlor' [1]

Example
-
jcr:system
  rep:documentStore
bundlor
  app:Asset{pattern = [jcr:content/metadata, jcr:content/renditions,
  jcr:content/renditions/**, jcr:content]}
  nt:file{pattern = [jcr:content]}
-

Key points


* This config is only required when system is using DocumentNodeStore
* Any change here would be picked via Observation
* Config is supposed to be changed only by system admin. So needs to
be secured (OAK-4959)
* Config can be changed anytime and would impact only newly created nodes.

Open Questions


Bootstrap default config
---

Should we ship with a default config for nt:file (may be other like
rep:AccessControllable). If yes then how to do that. One way can be to
introduce a new 'WhiteboardRepositoryInitializer' and then
DocumentNodeStore can register one which bootstraps a default config

Chetan Mehrotra
[1] 
https://issues.apache.org/jira/browse/OAK-1312?focusedCommentId=15387241=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15387241


Re: svn commit: r1765583 - in /jackrabbit/oak/trunk: oak-core/src/main/java/org/apache/jackrabbit/oak/api/jmx/ oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/property/strategy/ oak-cor

2016-10-21 Thread Chetan Mehrotra
On Thu, Oct 20, 2016 at 6:08 PM, Julian Sedding <jsedd...@gmail.com> wrote:
> I think we could get away with increasing this to 4.1.0 if we can
> annotate QueryEngineSettingsMBean with @ProviderType.

Makes sense. Opened OAK-4977 for that

Chetan Mehrotra


Build failing due to compilation errors in oak-segment-tar

2016-11-22 Thread Chetan Mehrotra
Build is failing locally and in CI [1] due to compilation error in
oak-segment-tar. Looks like SegmentGCStatus class is not checked in

[ERROR] 
/home/chetanm/git/apache/jackrabbit-oak/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/file/FileStoreGCMonitor.java:[32,51]
error: cannot find symbol
[ERROR] symbol:   class SegmentGCStatus
[ERROR] location: package org.apache.jackrabbit.oak.segment.compaction


Chetan Mehrotra
[1] 
https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1296/jdk=JDK%201.8%20(latest),nsfixtures=SEGMENT_MK,profile=unittesting/console


Re: Build failing due to compilation errors in oak-segment-tar

2016-11-22 Thread Chetan Mehrotra
Added missed file in r1770910

@Francesco/Andrei Can you check if its the intended file. With this
compilation passes on my setup
Chetan Mehrotra


On Wed, Nov 23, 2016 at 10:42 AM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> Build is failing locally and in CI [1] due to compilation error in
> oak-segment-tar. Looks like SegmentGCStatus class is not checked in
>
> [ERROR] 
> /home/chetanm/git/apache/jackrabbit-oak/oak-segment-tar/src/main/java/org/apache/jackrabbit/oak/segment/file/FileStoreGCMonitor.java:[32,51]
> error: cannot find symbol
> [ERROR] symbol:   class SegmentGCStatus
> [ERROR] location: package org.apache.jackrabbit.oak.segment.compaction
>
>
> Chetan Mehrotra
> [1] 
> https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1296/jdk=JDK%201.8%20(latest),nsfixtures=SEGMENT_MK,profile=unittesting/console


Re: oak-lucene shaded

2016-11-24 Thread Chetan Mehrotra
Hi Torgeir,

We would not be able shade Lucene classes as they are exported and
meant to be used by certain SPI implementations. So as of now there is
no solution for using a different Lucene version in non OSGi world


Chetan Mehrotra


On Wed, Nov 23, 2016 at 7:15 PM, Torgeir Veimo <torgeir.ve...@gmail.com> wrote:
> Second version, this pom file can be put in a separate directly as a self
> contained maven artifact and includes oak-lucene remotely.
>
> 
>
> 
>
> http://maven.apache.org/POM/4.0.0; xmlns:xsi="
> http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="
> http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd
> ">
> 4.0.0
>
> no.karriere
> 0.1-SNAPSHOT
> oak-lucene-shaded
> Oak Lucene (shaded)
> Oak Lucene integration subproject
>
> 
> 1.5
> 4.7.1
> 1.4.6
> 
>
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-source-plugin
> 3.0.1
> 
> 
> generate-sources-for-shade-plugin
> package
> 
> jar-no-fork
> 
> 
> 
> 
> 
> org.apache.maven.plugins
> maven-shade-plugin
> 3.0.0-SNAPSHOT
> 
> 
> package
> 
> shade
> 
> 
>
> 
>
> false
> true
> true
> 
>  implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
> 
> 
> 
> org.apache.lucene
>
> org.shaded.apache.lucene
> 
> 
> org.tartarus.snowball
>
> org.shaded.tartarus.snowball
> 
> 
> 
> 
>
> org.apache.jackrabbit:oak-core
>
> org.apache.jackrabbit:oak-commons
>
> org.apache.jackrabbit:oak-blob
>
> com.google.guava:guava
>
> commons-codec:commons-codec
> commons-io:commons-io
> javax.jcr:jcr
>
> org.apache.jackrabbit:jackrabbit-api
>
> org.apache.jackrabbit:jackrabbit-jcr-commons
>
> org.apache.tika:tika-core
> org.slf4j:slf4j-api
> 
> 
> 
> 
> 
> 
> 
> 
>
> 
> 
> org.apache.jackrabbit
> oak-core
> ${oak.version}
> 
> 
> org.apache.jackrabbit
> oak-lucene
> ${oak.version}
> 
> 
> org.apache.tika
> tika-core
> ${tika.version}
> 
>
> 
> 
> org.apache.lucene
> lucene-core
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-analyzers-common
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-queryparser
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-queries
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-suggest
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-highlighter
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-memory
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-misc
> ${lucene.version}
> 
> 
> org.apache.lucene
> lucene-facet
> ${lucene.version}
> 
>
> 
> org.apache.tika
> tika-parsers
>   

Re: Frequent failures in standby test

2016-11-24 Thread Chetan Mehrotra
Per https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1298/
the test again failed but mostly on Jdk 1.7. The test on Jdk 1.8 looks
like passed.
Chetan Mehrotra


On Tue, Nov 22, 2016 at 12:48 PM, Chetan Mehrotra
<chetan.mehro...@gmail.com> wrote:
> They are from oak-segment-tar. See
> https://builds.apache.org/job/Apache%20Jackrabbit%20Oak%20matrix/1295/#showFailuresLink
> Chetan Mehrotra
>
>
> On Tue, Nov 22, 2016 at 12:42 PM, Francesco Mari
> <mari.france...@gmail.com> wrote:
>> Are those from oak-tarmk-standby or oak-segment-tar?
>>
>> 2016-11-22 6:11 GMT+01:00 Chetan Mehrotra <chetan.mehro...@gmail.com>:
>>> Hi Team,
>>>
>>> Since last 4-6 builds I am seeing a recurring failure of few test in
>>> standby module
>>>
>>> * FailoverIPRangeIT
>>> * ExternalPrivateStoreIT
>>> * StandbyTestIT
>>>
>>> Probably something to be looked into
>>>
>>> Chetan Mehrotra


Re: oak-lucene shaded

2016-11-25 Thread Chetan Mehrotra
On Fri, Nov 25, 2016 at 2:33 PM, Torgeir Veimo <torgeir.ve...@gmail.com> wrote:
> I wasn't suggesting oak should adopt this approach at this time, it's
> merely a solution for those that need to combining oak with other code
> (usually elasticsearch) in a non-osgi environment (usually spring).

Okies ... I thought you wanted Oak to adopt the approach and hence the
confusion! Yes approach used should work fine for such cases. May be
we can later add support to produce such a oak-lucene jar

Chetan Mehrotra


<    1   2   3   4   5   6   7   >