[ANNOUNCE] Java 8 is needed for Jackrabbit Oak 1.4 and 1.6 in the future

2020-12-02 Thread Thomas Mueller
Dear users of Apache Jackrabbit Oak,

Java 8 will be needed for futures versions of Jackrabbit Oak 1.4 and 1.6, 
starting with Apache Jackrabbit Oak 1.4.27 and 1.6.21.

For details, see https://issues.apache.org/jira/browse/OAK-9294

This change is needed in order to allow upgrading to Apache Solr 8.6.3.

Regards,
Thomas



Intend to backport OAK-9184 to 1.8, 1.22

2020-08-28 Thread Thomas Mueller
Hi,

I intend to backport the fix for OAK-9184 to the 1.8 and 1.22 branches. The 
risk should be very limited.

Let me know i you have any concerns.

Regards,
Thomas

https://issues.apache.org/jira/browse/OAK-9184





Re: Intend to backport OAK-9065 to 1.8, 1.10, and 1.22

2020-06-08 Thread Thomas Mueller
Hi,

> the 1.10 branch was retired

Thanks! You are right, I won't backport to 1.10 then.

Regards,
Thomas




On 05.06.20, 17:49, "Marcel Reutegger"  wrote:

Hi,

I don't have general concerns with the backport, but please note the 1.10 
branch was retired on April 6th. See also 
https://jackrabbit.apache.org/oak/docs/roadmap.html
We shouldn't do any backports to that branch anymore.

Regards
 Marcel

On 05.06.20, 17:30, "Thomas Mueller"  wrote:


Hi,

I intend to backport the fix for OAK-9065 to the 1.8, 1.10, and 1.22 
branches. The risk should be very limited.

Let me know i you have any concerns.

Regards,
Thomas

https://issues.apache.org/jira/browse/OAK-9065





Intend to backport OAK-9065 to 1.8, 1.10, and 1.22

2020-06-05 Thread Thomas Mueller

Hi,

I intend to backport the fix for OAK-9065 to the 1.8, 1.10, and 1.22 branches. 
The risk should be very limited.

Let me know i you have any concerns.

Regards,
Thomas

https://issues.apache.org/jira/browse/OAK-9065



Re: Large string properties in repository

2019-09-19 Thread Thomas Mueller
Hi,

We are not sure yet if the property was 600 million characters long. It might 
have been only 1 million. Sure, we need to investigate this and log an issue 
about this. But we need a generic solution.

I think we should log a warning for strings larger than 100'000 characters. At 
some point we could maybe throw an exception, for example at 1 million 
characters by default (configurable).

Regards,
Thomas


On 16.09.19, 12:35, "Julian Reschke"  wrote:

On 16.09.2019 06:30, Mohit Kataria wrote:
> Hi Everyone,
>
> Recently I faced issues where in a repository having large string
> properties (async indexed) led to high heap usage. So I am adding a a warn
> log if we try to add large string properties. But I am not sure what 
should
> be the  default max length for such string properties. From your past
> experience please suggest a default value for such large string 
properties.
> ...

How large was it, and what exactly did happen? Maybe this deserves a bug
report.

Best regards, Julian




Re: Query options in queries with OR

2019-03-22 Thread Thomas Mueller
Hi,

Yes, this definitely looks like a bug... Could you file a Jira issue please?

Regards,
Thomas


On 22.03.19, 13:12, "Vikas Saurabh"  wrote:

That sounds like a bug to me. Would love to hear Thomas Mueller's thoughts
too though.

--Vikas
(sent from mobile)

On Fri 22 Mar, 2019, 17:26 Piotr Tajduś,  wrote:

> Hi,
>
> Not sure if this is a bug, but when query with OR is divided into union
> of queries, options (like index tag) are not passed into subqueries. I
> have fixed it in my copy of sources in
> org.apache.jackrabbit.oak.query.QueryImpl.copyOf() by copying options too.
>
>
> Best regards,
>
> Piotr Tajduś
>
>
>




Re: [Initially posted to users@j.a.o] Problem with read limits & a query using a lucene index with many results (but below setting queryLimitReads)

2019-02-13 Thread Thomas Mueller
Hi,

> Wouldn't it make sense to introduce a query option ala [1] to disable 
> read/memory limits for one particular query?

It's possible, but my fear is that people would use the option in their queries 
too often...

> OAK-6875 does not always have the desired effect (for sure there is some 
> un-deterministic behaviour for large content being accessed

Yes, I have seen cases where an index is re-opened during query execution. In 
that case, already returned entries are read again and skipped, so basically 
counted twice. I think it would be good to fix this (only count entries once).

I think queries should read at most a few thousands entries. That way, there 
are no problems if the limit is set to 100'000. If an application needs to read 
more than that, then best run multiple queries, using keyset pagination if 
needed:

* https://blog.jooq.org/tag/keyset-pagination/
* https://use-the-index-luke.com/no-offset

Regards,
Thomas
 



Re: Decide if a composite node store setup expose multiple checkpoint mbeans

2018-07-13 Thread Thomas Mueller
Hi,

I think we should discuss this. The right now, we use some of the beans like 
"global static" singletons. This might be a mistake, but it's like that. Now by 
introducing a second bean, this "contract" breaks. It's kind of like breaking 
backward compatibility...

Regards,
Thomas



On 09.07.18, 10:36, "Tomek Rękawek"  wrote:

Hello Vikas,

I think there was a similar case, described in OAK-5309 (multiple instances 
of the RevisionGCMBean). We introduced an extra property there - “role” - which 
can be used to differentiate the mbeans. It’s similar to the option 2 in your 
email. The empty role means that the mbean is related to the “main” node store, 
while non-empty one is only used for the partial node stores, gathered together 
by CNS. Maybe we can use similar approach here?

Regards,
Tomek

-- 
Tomek Rękawek | ASF committer | www.apache.org
tom...@apache.org

> On 5 Jul 2018, at 23:59, Vikas Saurabh  wrote:
> 
> Hi,
> 
> We recently discovered OAK-7610 [0] where
> ActiveDeletedBlobCollectorMBeanImpl got confused due to multiple
> implementations of CheckpointMBean being exposed in composite node
> store setups (since OAK-6315 [1] which implemented checkpoint bean for
> composite node store)
> 
> While, for the time being, we are going to avoid that confusion by
> changing ActiveDeletedBlobCollectorMBeanImpl to keep on returning
> oldest checkpoint timestamp if all CheckpointMBean implementations
> report the same oldest checkpoint timestamp. But that "work-around"
> works currently because composite node store uses global node store to
> list checkpoint to get oldest timestamp... but the approach is
> incorrect in general as there's no such guarantee.
> 
> So, here's the question for the discussion: how should the situation
> be handled correctly. Afaict, there are a few options (in decreasing
> order of my preference):
> 1. there's only a single checkpoint mbean exposed (that implies that
> mounted node store services need to "know" that they are mounted
> stores and hence shouldn't expose their own bean)
> 2. composite node store's checkpointMBean implementation can expose
> some metadata (say implement a marker interface) - discovering such
> implementation can mean "use this implementation for repository level
> functionality"
> 3. keep the work-around to be implemented in OAK-7610 [0] but document
> (ensure??) that the assumption that "all implementations would have
> same oldest checkpoint timestamp"
> 
> Would love to get some feedback.
> 
> [0]: https://issues.apache.org/jira/browse/OAK-7610
> [1]: https://issues.apache.org/jira/browse/OAK-7315
> 
> 
> Thanks,
> Vikas





Intent to backport OAK-7437 - SimpleExcerptProvider highlighting should be case insensitive

2018-04-26 Thread Thomas Mueller
Hi

I would like to backport https://issues.apache.org/jira/browse/OAK-7437.

Please let me know if you have any concern/objection.

Regards,
Thomas
 




Re: oak-search module

2018-04-04 Thread Thomas Mueller
+1

On 04.04.18, 10:23, "Tommaso Teofili"  wrote:

Hi all,

In the context of creating an (abstract) implementation for Oak full text
indexes [1], I'd like to create a new module called _oak-search_.
Such module will contain:
- implementation agnostic utilities for full text search (e.g. aggregation
utilities)
- implementation agnostic SPIs to be extended by implementors (currently we
expose SPIs in oak-lucene whose signatures include Lucene specific APIs)
- abstract full text editor / query index implementations
- text extraction utilities

Please share your feedback / opinions / concerns.

Regards,
Tommaso

[1] : https://issues.apache.org/jira/browse/OAK-3336




Intent to backport OAK-7131 (xpath to sql2 conversion drops order by clause for some cases)

2018-01-17 Thread Thomas Mueller
I want to backport OAK-7152 to all maintenance branches. The fix is simple and 
low risk.

Regards,
Thomas




Re: Intent to backport OAK-7152

2018-01-17 Thread Thomas Mueller
+1

On 15.01.18, 09:47, "Marcel Reutegger"  wrote:

Hi,

I will backport OAK-7152 to all maintenance branches. The fix is trivial 
and very low risk because the method currently simply does not return.

Regards
 Marcel





Re: Consider making Oak 1.8 an Oak 2.0

2017-12-06 Thread Thomas Mueller
Hi,

> Upgrading lucene to version 6 would probably warrant using 2.0, but that's 
> not ready yet for 1.8?

No, it's not yet ready for 1.8.

Regards,
Thomas
 



Re: Consider making Oak 1.8 an Oak 2.0

2017-12-06 Thread Thomas Mueller
I vote for 1.8. I don't see any big changes that would justify version 2.0. The 
modularization (moving code around) is an ongoing process, I don't think this 
is "fixed", and shouldn't have a big impact on users.



Re: [VOTE] Release Apache Jackrabbit Oak 1.6.4

2017-08-16 Thread Thomas Mueller
> Please vote on releasing this package as Apache Jackrabbit Oak 1.6.4.
+1  

Thomas





Re: [CompositeBlobStore] Delegate traversal algorithm

2017-08-16 Thread Thomas Mueller
Hi,

The Bloom filter is something to consider, to speed up reading. It's not 
strictly needed of course.

Yes, I would consider using Bloom filters to more quickly find out where an 
entry is stored, if there are multiple possibilities. So, one filter per 
"delegate". In our case, the most logical place to do that is for the read-only 
stores. They could also be used for read-write stores (created during garbage 
collection for example). Sure, they would not always be up-to-date, but most 
(let's say 90%) binaries are older than the last GC, so it would speedup that 
case (and have basically no cost for new entries, as the filter is in memory).

Regards,
Thomas





On 16.08.17, 02:25, "Matt Ryan" <o...@mvryan.org> wrote:

Hi Thomas (and everyone else):

I wanted to ask about a comment you made in the wiki where you said "Bloom
filters should be mentioned (are they used, if yes how, if not why not).”
 I assume since you included that you are thinking they probably should be
used.

I believe the intended use of a Bloom filter in this case would be for read
operations, to quickly determine if a blob id is not stored anywhere in the
system.  Let me know if you had another use in mind.

If that’s the use case, I wonder how we would reasonably come up with a
useful guess as to the appropriate size of the filter.  Someone with more
experience using them could maybe offer some insight here as to appropriate
values for an expected number of insertions and the appropriate expected
false positive probability.

It seems like we could also use more than one Bloom filter, one for each
delegate to say whether the blob id is located in that particular
delegate.  Not sure if you were thinking more along those lines or just a
single Bloom filter for the entire composite as a whole, or both.

-MR

On August 15, 2017 at 4:06:56 PM, Matt Ryan (o...@mvryan.org) wrote:

Hi Thomas,

After emailing I saw you also provided comments in-line on the wiki.  I’ll
work through those and reply back on-list when I think I have addressed
them.  Thanks for doing that also!

-MR


On August 15, 2017 at 2:01:04 PM, Matt Ryan (o...@mvryan.org) wrote:

Hi Thomas,

Thank you for taking the time to offer a review.  I’ve been going through
the suggested readings and will continue to do so.

Some comments inline below.


On August 15, 2017 at 12:25:54 AM, Thomas Mueller 
(muel...@adobe.com.invalid)
wrote:

Hi,

It is important to understand which operations are available in the JCR
API, the DataStore API, and the concept of revisions we use for Oak. For
example,

* The DataStore API doesn’t support updating a binary.


This is of course true.  The interface supports only an “addRecord()”
capability to put a blob into the data store.  The javadoc there clearly
expects the possibility that the record may already exist:  "If the same
stream already exists in another record, then that record is returned
instead of creating a new one.”

Implementations handle the details of what happens when the blob already
exists.  For example, the “write()” method in the S3Backend class clearly
distinguishes between the two as the way to handle this via the AWS SDK is
different for an update versus a create:

https://svn.apache.org/repos/asf/jackrabbit/oak/trunk/oak-blob-cloud/src/main/java/org/apache/jackrabbit/oak/blob/cloud/s3/S3Backend.java

It is still the case that from the data store’s point of view there is no
difference between the two so it doesn’t support a distinction.

The original data store concept can take this approach because it only has
one place for the data to go.  The composite blob store has more than one
place the data could go, so I believe there is a possibility that the data
could exist in a delegate blob store that is not the first blob store that
the data could be written to.

What should happen in that case?  I assumed we should try to find a match
first, and prefer updating to creating new.  I’m not sure exactly how that
would happen though, since the name only matches if the content hash is the
same (unless there’s a collision of course), and otherwise it’s a new blob
anyway.



* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old
revision.


Does the data store even know about this?  I assumed this was all handled
at a higher level, and that once the data store is told to add a record
it’s already been determined that the write is okay, even if it ends up
that the stream being written already exists somewhere.



* The JCR API allows to create binarie

Re: [CompositeBlobStore] Delegate traversal algorithm

2017-08-15 Thread Thomas Mueller
Hi,

It is important to understand which operations are available in the JCR API, 
the DataStore API, and the concept of revisions we use for Oak. For example, 

* The DataStore API doesn’t support updating a binary.
* A node might have multiple revisions.
* In the Oak revision model, you can't update a reference of an old revision.
* The JCR API allows to create binaries without nodes via ValueFactory (so it's 
not possible to use storage filters at that time).

What you didn't address is how to read if there are multiple possible storage 
locations, so I assume you didn't think about that case. In my view, this 
should be supported. You might want to read up on LSM trees on how to do that: 
using bloom filters for example.

Suggested readings:
* https://docs.adobe.com/content/docs/en/spec/jsr170/javadocs/jcr-2.0/index.html
* https://docs.adobe.com/content/docs/en/spec/jcr/1.0/index.html
* https://en.wikipedia.org/wiki/Content-addressable_storage
* https://en.wikipedia.org/wiki/Log-structured_merge-tree

Regards,
Thomas



On 15.08.17, 08:00, "Thomas Mueller" <muel...@adobe.com> wrote:

Hi,

I read you wiki update, and this caught my eye:

>  If a match is found, the write is treated as an update; if no match is 
found, the write is treated as a create.

In the DataStore, there is no such thing as an update. There are only the 
following operations:

* write
* read
* delete, via garbage collection

See also https://en.wikipedia.org/wiki/Content-addressable_storage

Regards,
Thomas


On 14.08.17, 17:17, "Matt Ryan" <o...@mvryan.org> wrote:

Bump.  If anyone has feedback I’d love to hear it.


On August 3, 2017 at 6:27:39 PM, Matt Ryan (o...@mvryan.org) wrote:

Hi,

I’ve been thinking the past few days about how a composite blob store 
might
go about prioritizing the delegate blob stores for reading and writing,
considering concepts like storage filters on a blob store, read-only 
blob
stores, and archive or “cold” blob stores (which we don’t currently 
have,
but could in the future).

Storage filters basically restrict what can be stored in a delegate - 
like
saying only blobs with a certain JCR property, etc.  (I realize there 
are
implications with this too - I’ll worry about that in a separate thread
someday.)

I’d like feedback on the following idea:
- Create a new public interface in Oak that can be injected into the
composite blob store and used to handle the delegate prioritization for
reads and writes.
- Create a default implementation of this interface that can be used in
most cases (see below).

This would allow extensibility in this area to implement new or more 
custom
algorithms for any future use cases, as needed, without tying it to
configuration.

The default implementation would be basically this:
- For reads:
  - Delegates with storage filters first
  - Delegates without storage filters next
  - Read-only delegates next (with filters first, then without)
  - Retry reads on delegates with with filters that were previously 
skipped
(this is a special case)
  - Cold storage delegates last

- For writes:
  - Search for an existing blob first using the “read” algorithm - 
always
update an existing blob, if one is found (except in cold storage)
  - If not found:
- Try delegates with storage filters first
- Delegates without storage filters next

The special case to retry reads on delegates with filters that were
previously skipped is to handle configuration change.  Essentially, if a
blob is stored in a delegate blob store, and then the configuration for
that delegate changes so that the blob wouldn’t be stored there if it 
was
being written now, we want to be able to locate it during the time 
between
when the configuration change happens and some background curator moves 
the
blob to the correct location.


So in short, I’d do the default implementation as described, but a
different implementation could be injected instead, if someone wanted a
more custom one.


WDYT?


-MR






Re: [CompositeBlobStore] Delegate traversal algorithm

2017-08-15 Thread Thomas Mueller
Hi,

I read you wiki update, and this caught my eye:

>  If a match is found, the write is treated as an update; if no match is 
> found, the write is treated as a create.

In the DataStore, there is no such thing as an update. There are only the 
following operations:

* write
* read
* delete, via garbage collection

See also https://en.wikipedia.org/wiki/Content-addressable_storage

Regards,
Thomas


On 14.08.17, 17:17, "Matt Ryan"  wrote:

Bump.  If anyone has feedback I’d love to hear it.


On August 3, 2017 at 6:27:39 PM, Matt Ryan (o...@mvryan.org) wrote:

Hi,

I’ve been thinking the past few days about how a composite blob store might
go about prioritizing the delegate blob stores for reading and writing,
considering concepts like storage filters on a blob store, read-only blob
stores, and archive or “cold” blob stores (which we don’t currently have,
but could in the future).

Storage filters basically restrict what can be stored in a delegate - like
saying only blobs with a certain JCR property, etc.  (I realize there are
implications with this too - I’ll worry about that in a separate thread
someday.)

I’d like feedback on the following idea:
- Create a new public interface in Oak that can be injected into the
composite blob store and used to handle the delegate prioritization for
reads and writes.
- Create a default implementation of this interface that can be used in
most cases (see below).

This would allow extensibility in this area to implement new or more custom
algorithms for any future use cases, as needed, without tying it to
configuration.

The default implementation would be basically this:
- For reads:
  - Delegates with storage filters first
  - Delegates without storage filters next
  - Read-only delegates next (with filters first, then without)
  - Retry reads on delegates with with filters that were previously skipped
(this is a special case)
  - Cold storage delegates last

- For writes:
  - Search for an existing blob first using the “read” algorithm - always
update an existing blob, if one is found (except in cold storage)
  - If not found:
- Try delegates with storage filters first
- Delegates without storage filters next

The special case to retry reads on delegates with filters that were
previously skipped is to handle configuration change.  Essentially, if a
blob is stored in a delegate blob store, and then the configuration for
that delegate changes so that the blob wouldn’t be stored there if it was
being written now, we want to be able to locate it during the time between
when the configuration change happens and some background curator moves the
blob to the correct location.


So in short, I’d do the default implementation as described, but a
different implementation could be injected instead, if someone wanted a
more custom one.


WDYT?


-MR




Re: Intent to backport OAK-5899

2017-07-21 Thread Thomas Mueller
+1

On 12.07.17, 06:29, "Chetan Mehrotra"  wrote:

OAK-5899



Re: Intent to backport to 1.6: OAK-5827

2017-07-21 Thread Thomas Mueller
+1


On 13.07.17, 13:22, "Julian Reschke"  wrote:

https://issues.apache.org/jira/browse/OAK-5827

"Don't use SHA-1 for new DataStore binaries"

(security related and in trunk since March)

Best regards, Julian




Intent to backport OAK-6359 (Change behavior for very complex queries) to older versions

2017-07-21 Thread Thomas Mueller
Hi,

OAK-6359 prevents complex queries to result in using 100% CPU (in an almost 
endless loop) / eventually running out of memory. The fix is simple and already 
tested in trunk. There is a feature flag that allows switching to the old 
behaviour.

Regards,
Thomas



Re: [DiSCUSS] - highly vs rarely used data

2017-07-11 Thread Thomas Mueller
Hi,

On 10.07.17, 11:18, "Bertrand Delacretaz"  wrote:
> Throw an exception maybe? BinaryNotAvailableAtThisTime, including an
> ETA for availability. The application can then decide how to handle
>that.

Bertrand, this is exactly what I have suggested in two previous mails:

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an 
exception saying so, and load the binary into hot storage. A few minutes later, 
re-reading will not throw an exception as it's in hot storage. So, there is no 
API change needed, except for a new exception class (subclass of 
RepositoryException). An application can catch those exceptions and deal with 
them in a special way (write that the binary is not currently available). 
Possibly the new exception could have a method "doNotMoveBinary()" in case 
moving is not needed, but by default the binary should be moved, so that old 
applications don't have to be changed at all (backward compatibility).

Regards,
Thomas
 



Re: [DiSCUSS] - highly vs rarely used data

2017-07-05 Thread Thomas Mueller
Hi,

> (a) the implementation of an automatism is not *quite* what they need/want
> (b) they want to be able to manually select (or more likely override)
whether a file can be archived

Well, behind the scenes, we anyway need a way to move entries to / from cold 
storage. But in my view, that's low-level API, and I wouldn't expose it first, 
but instead concentrate on implementing an automatic solution, that has no API 
(except for some config options). If it later turns out the low-level API is 
needed, it can still be added. I wouldn't introduce that as public API right 
from the start, just because we _think_ it _might_ be needed at some point 
later. Because having to maintain the API is expensive.

What I would introduce right from the start is a way to measure which binaries 
were read recently, and how frequently. But even for that, there is no public 
API needed first (except for maybe logging some statistics).

> Thus I suggest to come up with a pluggable "strategy" interface

That is too abstract for me. I think it is very important to have a concreate 
behaviour and API, otherwise discussing it is not possible.

> A much more important and difficult question to answer IMHO is how to deal 
> with the slow retrieval of archived content.

My concrete suggestion would be, as I wrote: if it's in cold storage, throw an 
exception saying so, and load the binary into hot storage. A few minutes later, 
re-reading will not throw an exception as it's in hot storage. So, there is no 
API change needed, except for a new exception class (subclass of 
RepositoryException). An application can catch those exceptions and deal with 
them in a special way (write that the binary is not currently available). 
Possibly the new exception could have a method "doNotMoveBinary()" in case 
moving is not needed, but by default the binary should be moved, so that old 
applications don't have to be changed at all (backward compatibility).

What is your concrete suggestion?

Regards,
Thomas 



Re: [VOTE] Release Apache Jackrabbit Oak 1.7.3

2017-07-04 Thread Thomas Mueller
+1 Release this package as Apache Jackrabbit Oak 1.7.3
 



Re: [DiSCUSS] - highly vs rarely used data

2017-07-03 Thread Thomas Mueller
Hi,

> a property on the node, e.g. "archiveState=toArchive"

I wonder if we _can_ easily write to the version store? Also, some nodetypes 
don't allow such properties? It might need to be a hidden property, but then 
you can't use the JCR API. Or maintain this data in a "shadow" structure (not 
with the nodes), which would complicate move operations.

If I was a customer, I wouldn't wan't to *manually* mark / unmark binaries to 
be moved to / from long time storage. I would probably just want to rely on 
automatic management. But I'm not a customer, so my opinion is not that 
relevant (

> Using a property directly specified for this purpose gives us more direct 
> control over how it is being used I think.

Sure, but it also comes with some complexities.

Regards,
Thomas





Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
> From my perspective as an Oak user I would like to have control on that.
> It would be nice for Oak to make *suggestions* about moving things to
> cold storage, but there might be application constraints that need to
> be accounted for.

That sounds reasonable. What would be the "API" for this? Let's say the API is: 
configure a path that _allows_ binaries to be migrated to cold storage. It's 
not allowed for all other paths. The default configuration could be: allow for 
/jcr:system/jcr:versionStorage, don't allow anywhere else. This could be 
implemented using automatic moving (as I have described), _plus_ a background 
job that, twice a month, traverses all nodes and reads the first few bytes of 
all nodes that are _not_ in /jcr:system/jcr:versionStorage. The traversal could 
additionally do some reporting, for example how many binaries are were, how 
many times where they read, how much money could you save if configured like 
this.

For automatic moving, behaviour could be:

- To move to cold storage: configuration would be needed: size, access 
frequency, recency (e.g. only move binaries larger than 1 MB that were not 
access for one month, and that were accessed only once in the month before 
that).

- When trying to access a binary that is in cold storage: you get an exception 
saying the binary is in cold storage. Plus, if configured, the binary would 
automatically be read from cold storage, so it's available within x minutes 
(configurable) when re-read.

- Bulk copy from cold storage to regular storage: This might be needed to 
create a full backup. We might need an API for this. 

Regards,
Thomas



Re: [DiSCUSS] - highly vs rarely used data

2017-06-30 Thread Thomas Mueller
Hi,

I guess you talk about Amazon Glacier. Did you know about "Expedited 
retrievals" by the way? 
https://aws.amazon.com/about-aws/whats-new/2016/11/access-your-amazon-glacier-data-in-minutes-with-new-retrieval-options/
 - it looks like it's more than just "slow" + "fast".

About deciding which binaries to move to the slow storage: It would be good if 
that's automatic. Couldn't that be based on access frequency + recency? If a 
binary is not accessed for some time, it is moved to slow storage. I would add: 
if it was not accessed for some time, _plus_ it was rarely accessed before. 
Reason: for caching, it is well known that not only the recency, but also 
frequency, are important to predict if an entry will be needed in the near 
future. To do that, we could maintain a log that tells you when, and how many 
times, a binary was read. Maybe Amazon / Azure keep some info about that, but 
let's assume not (or not in such a way we want or can use). 

For example, each client appends the blob ids that it reads to a file. Multiple 
such files could be merged. To save space for such files (probably not needed, 
but who knows):

* Use a cache to avoid repeatedly writing the same id, in case it's accessed 
multiple times.
* Maybe you don't care about smallish binaries (smaller than 1 MB for example), 
or care less about them. So, for example only move files larger than 1 MB. That 
means no need to add an entry.
* A bloom filter or similar could be used (so you would retain x% too many 
entries). Or even simpler: only write the first x characters of the binary id. 
That way, we retain x% too much in fast storage, but save time, space, and 
memory for maintenance.

Regards,
Thomas


On 26.06.17, 18:10, "Matt Ryan"  wrote:

Hi,

With respect to Oak data stores, this is something I am hoping to support
later this year after the implementation of the CompositeDataStore (which
I'm still working on).

First, the assumption is that there would be a working CompositeDataStore
that can manage multiple data stores, and can select a data store for a
blob based on something like a JCR property (I'm still figuring this part
out).  In such a case, it would be possible to add a property to blobs that
can be archived, and then the CompositeDataStore could store them in a
different location - think AWS Glacier if there were a Glacier-compatible
data store.  Of course this would require that we also support an access
pattern in Oak where Oak knows that a blob can be retrieved but cannot
reply to a request with the requested blob immediately.  Instead Oak would
have to give a response indicating "I can get it, but it will take a while"
and suggest when it might be available.

That's just one example.  I believe once I figure out the
CompositeDataStore it will be able to support a lot of neat scenarios from
on the blob store side of things anyway.

-MR

On Mon, Jun 26, 2017 at 2:22 AM, Davide Giannella  wrote:

> On 26/06/2017 09:00, Michael Dürig wrote:
> >
> > I agree we should have a better look at access patterns, not only for
> > indexing. I recently came across a repository with about 65% of its
> > content in the version store. That content is pretty much archived and
> > never accessed. Yet it fragments the index and thus impacts general
> > access times.
>
> I may say something stupid as usual, but here I can see for example that
> such content could be "moved to a slower repository". So for example
> speaking of segment, it could be stored in a compressed segment (rather
> than plain tar) and the repository could either automatically configure
> the indexes to skip such part or/and additionally create an ad-hoc index
> which could async by definition every, let's say, 10 seconds.
>
> We would gain on the repository size and indexing speed.
>
> Just a couple of ideas off the top of my head.
>
> Davide
>
>
>




Re: Nodetype index

2017-06-30 Thread Thomas Mueller
Hi,

Right now, there is only one nodetype index. So, if you add a nodetype / mixin 
to that index (as you know the lists of nodetypes / mixins is a multi-valued 
property), then you need to reindex that index. Which needs to read all the 
nodes.

The alternative would be to have multiple nodetype indexes. A patch for that is 
welcome! If you have that, then instead of changing the nodetype index, you 
create a new index. This also needs to read all the nodes.

So, that would be more convenient (even though in both cases indexing a new 
nodetype takes about the same time).

> Is this a design choice to only allow one nodetype index?

Not a design choice, just how it is implemented right now.

> I have no way to make it contained in the project itself to make an index 
> based on it's mixin type.

There is a way, you need some code to extend the existing nodetype index, and 
then reindex.

Regards,
Thomas



 On 29.06.17, 16:42, "Roy Teeuwen"  wrote:

Hey all,

Some time ago I asked about creating an oak index based on the node type 
(primary type or mixin type), after which I was pointed to the nodetype index. 
I have to say though that there is a serious drawback to this index:

I have two separate projects, both having some code based on a mixinType. 
But seeing as there can only be one nodetype index, I have no way to make it 
contained in the project itself to make an index based on it's mixin type.
Is this a design choice to only allow one nodetype index? Is there a 
workaround for this?

Thanks!
Roy




Intend to backport OAK-6391 - With FastQuerySize, getSize() returns -1 if there are exactly 21 rows

2017-06-29 Thread Thomas Mueller

Hi,

I'd like to backport OAK-6391 to the maintenance branches. The query result 
getSize() method is often used, and it is important that the result is as 
accurate as possible (even though the spec allows to return -1).

Regards,
Thomas






Re: backporting OAK-6317 until 1.2 branch

2017-06-08 Thread Thomas Mueller
+1

On 08.06.17, 11:29, "Tommaso Teofili"  wrote:

Hi all,

I'd like to backport the fix for a bug in LMSEstimator [1] (LMSEstimator is
used by oak-solr-core to estimate the no. of entries in the index without
issuing a query to Solr) until branch 1.2 (as it was observed on a 1.2.x
Oak instance).

Regards,
Tommaso

[1] : https://issues.apache.org/jira/browse/OAK-6317




Re: [VOTE] Release Apache Jackrabbit Oak 1.7.1

2017-06-07 Thread Thomas Mueller
Ah, same as Alex!


On 06.06.17, 18:06, "Alex Parvulescu"  wrote:

[X] +1 Release this package as Apache Jackrabbit Oak 1.7.1


I had a transient error on
'ActiveDeletedBlobCollectorTest.multiThreadedCommits:230' but it went away
on the second run.

alex


On Tue, Jun 6, 2017 at 2:27 PM, Davide Giannella  wrote:

> A candidate for the Jackrabbit Oak 1.7.1 release is available at:
>
> https://dist.apache.org/repos/dist/dev/jackrabbit/oak/1.7.1/
>
> The release candidate is a zip archive of the sources in:
>
>
> https://svn.apache.org/repos/asf/jackrabbit/oak/tags/jackrabbit-oak-1.7.1/
>
> The SHA1 checksum of the archive is
> 4109f37f1533b6aa23f667fbd8d0ef213e67d6aa.
>
> A staged Maven repository is available for review at:
>
> https://repository.apache.org/
>
> The command for running automated checks against this release candidate 
is:
>
> $ sh check-release.sh oak 1.7.1 4109f37f1533b6aa23f667fbd8d0ef
> 213e67d6aa
>
> Please vote on releasing this package as Apache Jackrabbit Oak 1.7.1.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Jackrabbit PMC votes are cast.
>
> [ ] +1 Release this package as Apache Jackrabbit Oak 1.7.1
> [ ] -1 Do not release this package because...
>
> Davide
>




Re: [VOTE] Release Apache Jackrabbit Oak 1.7.1

2017-06-07 Thread Thomas Mueller
FYI I got a test failure in oak-lucene, 
ActiveDeletedBlobCollectorTest.multiThreadedCommits, see comment in OAK-2808.

This is not a vote. I guess it just happens just on my machine, no need to 
block the release.




Re: Intent to backport to 1.4: OAK-5612

2017-05-31 Thread Thomas Mueller
+1 (fixing tests is always good)

On 31.05.17, 12:01, "Julian Reschke"  wrote:

https://issues.apache.org/jira/browse/OAK-5612

(test case improvement)




Re: Intent to backport to 1.6/1.4/1.2/1.0: OAK-5652

2017-04-27 Thread Thomas Mueller
+1

On 27.04.17, 11:29, "Julian Reschke"  wrote:

https://issues.apache.org/jira/browse/OAK-5652

(test dependency)




Intent to backport to 1.6: OAK-6116

2017-04-27 Thread Thomas Mueller
Severity:

* Affects generated queries (generated using a query builder tool).
* The workaround is to _not_ use path restrictions in the query, which slows 
down the query.
* Only affects new queries.
* Failure is during the parse phase.

The risk of fixing:

* The change is limited to "union" queries (a new feature in Oak 1.6)
* The change is small
* There are tests
* Already reviewed

Risk of not fixing:

* Breaks the new feature of XPath union queries.
* Having to switch to the "old-style" (restricting the path manually).

Regards,
Thomas





Re: New JIRA component for observation

2017-04-03 Thread Thomas Mueller
Hi,

> I would prefer to stay aligned with Maven boundaries as much as possible 
> as this simplifies bug reporting for parties not deeply involved with 
> Oak very much.

Actually, I don't think that's a problem. I wouldn't expect such a person to 
specify any module (logical or maven).

>  Most of the apparent need to break out of that scheme is 
> to me rather a symptom of missing modularity rather than a cure.

It sounds like "modularity" for you means "Maven modularity". I don't agree 
this is the only "modularity" there is.

> If we introduce logical modules in Jira, I strongly suggest to come up with a 
> clear and concise definition for them: what exactly belongs to them, 
> what not?

Yes. And I would like to understand the reason on why we use modules (for 
reporting, easier to find issues,…?)

Regards,
Thomas




Re: New JIRA component for observation

2017-03-30 Thread Thomas Mueller
Hi,

I think the main question is, what do we use the Jira component for. Right now, 
I don't use it. Do we want to use it for statistics, or to be able to "monitor" 
or "group" issues by group? Depending on that, we can use "Maven" module 
boundaries, or "Logical" module boundaries. For example, we might want to add 
"OSGi" even though it's not a separate Maven module. "Observation", "Query", 
and "Security" are also in multiple Maven projects (at least jcr and core), no 
matter how we split or not split.

Regards,
Thomas



On 29.03.17, 08:59, "Angela Schreiber"  wrote:

hi chetan

i don't really see the problem with the big amount of issues inside the
'core' module.
on a regular basis i look at unassigned issues and those without a
component to see if there is anything in there that i missed.

from a consumer point of view though i see a lot of benefit of having the
structure aligned with svn because you don't have to wonder where to put
stuff.

kind regards
angela



On 29/03/17 08:02, "Chetan Mehrotra"  wrote:

>Not sure if we should have a 1-1 mapping between JIRA Component and
>Module at svn level. We can create logical components and later align
>them as and when new modules are carved out. If required JIRA
>components can be merged and renamed easily.
>
>As said having specific JIRA component for some logical feature set in
>Oak allows better tracking and discovery of logged issues which is
>harder with current set where "core" component has lots of different
>types of issues clubbed together
>Chetan Mehrotra
>
>
>On Tue, Mar 28, 2017 at 12:32 PM, Angela Schreiber 
>wrote:
>> i agree with marcel.
>> in general i would rather move forward with the modularisation and then
>> adjust jira accordingly.
>>
>> kind regards
>> angela
>>
>> On 27/03/17 09:26, "Marcel Reutegger"  wrote:
>>
>>>Hi,
>>>
>>>I'm wondering if this is the best approach. Initially we used the JIRA
>>>component 1:1 for modules we have in SVN. Now we also use them for
>>>sub-modules like 'documentmk', 'mongomk', 'property-index', ...
>>>
>>>In my view this indicates that the existing modules should probably be
>>>split and we'd be back to a 1:1 relation between modules in SVN and
>>>components in JIRA. Alternatively, we could also use JIRA labels and
>>>group issues by features like observation.
>>>
>>>Regards
>>>  Marcel
>>>
>>>On 27/03/17 07:57, Chetan Mehrotra wrote:
 I analyzed the issues currently logged under component "core" which
 has ~100 issues. Looking at most issues I think we can do following

 1. Create a new component for observation issues i.e. "observation"
 2. Avoid marking same issue for multiple component like "documentmk
 and core" unless the change impacts code base outside of that
 component like in this case outside of documentmk package

 This would ensure that we can get some better sense out of issues
 currently clubbed under "core"

 Thoughts?

 Chetan Mehrotra

>>





Re: disabling nodetype index

2017-03-30 Thread Thomas Mueller
Hi,

Yes, it's safe to disable. Actually it's a good idea to disable, or at least 
change so that only few nodetypes are indexed (for example 
oak:QueryIndexDefinition nodes, or other config nodes).

Regards,
Thomas

On 29.03.17, 20:00, "Alex Benenson"  wrote:

Hi all

Is it safe to disable nodetype index? We do not have any custom queries
that needed it. Is it used for any oak internals?

Assuming it is safe to disable, can I follow that by deleting every node
under /oak:index/nodetype/:index ?

db.nodes.remove({_id: /^\d+:\/oak:index\/nodetype\/:index/})

(OAK 1.4, mongodb)


Thanks

-- 
Alex Benenson




Re: problem on oak jcr sql2 query

2017-03-24 Thread Thomas Mueller
Could you post the index definition please?


From: Ancona Francesco 
Reply-To: "oak-dev@jackrabbit.apache.org" 
Date: Thursday, 23 March 2017 at 15:19
To: "oak-dev@jackrabbit.apache.org" 
Cc: Diquigiovanni Simone 
Subject: problem on oak jcr sql2 query

Hi all,
we use SolrSrerver for fulltext searches; both on metadata both on content 
binary.
In general i have to find all nodes nt:file that contain the word “company” or 
all nodes that have childs nt:resource that contain the same word.

Unfortunately if upload e file (so a node that is a nt:resource) and i use this 
query
SELECT p.* FROM [nt:file] as p where contains(p.*,''company ')

Solr find result  but the RowIterator doesn’t return anything.

Instead the above query works
SELECT p.* FROM [nt:resource] as p where contains(p.*,'company')
But doesn’t find nt:file nodes

Can you help me ?

Thanks in advance.


[cid:image002.png@01D2A3E8.D7747740]
Francesco Ancona | Software Dev. Dept. (SP) - Software Architect
tel. +39 049 8979797 | fax +39 049 8978800 | cel. +39 3299060325
e-mail: francesco.anc...@siav.it | 
www.siav.it

I contenuti di questa e-mail e dei suoi allegati sono confidenziali e riservati 
esclusivamente ai destinatari.
L'utilizzo per qualunque fine del presente messaggio e degli allegati così come 
la relativa divulgazione senza l'autorizzazione del mittente sono vietati.
Se avete ricevuto questa e-mail per errore, vi preghiamo di distruggerla e di 
comunicarcelo.
I dati personali sono trattati esclusivamente per le finalità della presente 
comunicazione in conformità con la legislazione vigente (D.lgs. 196/2003 
"Codice Privacy").
Per informazioni: SIAV S.p.A. – s...@siav.it – 049 8979797

The contents of this e-mail and its attachments are confidential and reserved 
exclusively to the recipients.
The use for any purpose of this message and attachments as well as its 
disclosure without the consent of the sender is prohibited.
If you have received this email in error, please destroy it and notify us.
Personal data shall be processed solely for the purposes of this notice in 
accordance with current legislation (Legislative Decree no. 196/2003 "Code").
For more information: SIAV S.p.A. – s...@siav.it – 049 8979797



Re: Supporting "resumable" operations on a large tree

2017-02-24 Thread Thomas Mueller
Hi,

>So we can implement a "paginated tree traversal"

Yes, I thinks that's a first step, something for oak-core which can be
re-used in multiple places. It might make sense to also create a JCR
version, for other use cases.

Regards,
Thomas



Re: SHA-1 collision

2017-02-24 Thread Thomas Mueller
Hi,

I created OAK-5827 to track this.

The problem is not just that there exist two files. I think it is a real
security vulnerability, because:

https://security.googleblog.com/2017/02/announcing-first-sha1-collision.htm
l


"we will wait 90 days before releasing code that allows anyone to create a
pair of PDFs that hash to the same SHA-1 sum given two distinct images
with some pre-conditions."

Regards,
Thomas




On 24/02/17 08:12, "Thomas Mueller" <muel...@adobe.com> wrote:

>Hi,
>
>A SHA-1 collision has been published:
>https://www.schneier.com/blog/archives/2017/02/sha-1_collision.html
>https://security.googleblog.com/2017/02/announcing-first-sha1-collision.ht
>ml
>
>Our FileDataStore and S3DataStore use SHA-1. For new binaries, we should
>use (for example) SHA-256.
>
>Right now, a content management system that uses Oak as the repository
>can't serve those two files at the same time, if it uses the
>FileDataStore or the S3DataStore.
>
>(The FileBlobStore, MongoDB BlobStore,..., are not affected)
>
>Regards,
>Thomas
>
>
>



Re: Supporting "resumable" operations on a large tree

2017-02-23 Thread Thomas Mueller
Hi,

My suggestion is to _not_ support "resumable" operations on a large tree,
but instead don't use large operations. But I wouldn't call my solution
"sharding", but more "bit-by-bit reindexing". Some more details: For
indexing (specially synchronous property indexes) I suggest to do the
following, for both a new index, and for reindexing:

1) Reindexing is writing to a new subtree, that is, ":index_1",
":index_2",... Which one is used is (automatically) set in the index
itself at /oak:index/<...>, in a hidden property ":writeNode"

2) Reading (queries) use the old subtree if there is any (":index" right
now). Which one it used is (automatically) set in the index itself at
/oak:index/<...>, in a hidden property ":liveNode"

3) Synchronous index updates are written to the index node defined at
":liveNode". Therefore, at the same time, reindex to a new subtree and
index updates to the old subtree can occur (actually more than that, see
below at 6).

4) To track reindexing/indexing progress, there is a "current position"
persisted at /oak:index/<...>, in a hidden property ":writePath". This
entry is automatically advanced by the asynchronous reindexing thread from
"/" to "/a" to "/b/a" to "/content/a/b", "/content/x", "/system/03/01",
"/system/04", "/system/0a",... and so on, until the whole repository is
indexed.

5) The asynchronous indexing thread reads all nodes of the repository
_after_ what is written at ":writePath" and indexes that to the new
subtree, and once there are 1000 changes to the index, the last indexed
path is written to ":writePath", plus the additions to the index, in one
transaction.

6) Synchronous index updates are also written to the index node defined at
":writeNode", if they affect nodes before the ":writePath".

7) After a restart, reindexing continues where it left off, using the last
":writePath".

8) After reindexing is done, ":livePath" is updated to ":index_2", and
":writePath" is removed. Also, the old index subtree is removed (1000
nodes per commit).

9) Sorting of path is needed, so that the repository can be processed bit
by bit by bit. For that, the following logic is used, recursively: read at
most 1000 child nodes. If there are more than 1000, then this subtree is
never split but processed in one step (so many child nodes can still lead
to large transactions, unfortunately). If less than 1000 child nodes, then
the names of all child nodes are read, and processed in sorted order
(sorted by node name).

Therefore, all reindexing operations use small transactions. Queries can
run concurrently. Reindexing can be paused. Reindexing can continue even
after a restart. While reindexing, the old index is maintained and
up-to-date. The branch-less commit mode is not needed. No conflicts
between the synchronous index updates and asynchronous reindexing thread
are possible.

Regards,
Thomas



SHA-1 collision

2017-02-23 Thread Thomas Mueller
Hi,

A SHA-1 collision has been published:
https://www.schneier.com/blog/archives/2017/02/sha-1_collision.html
https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Our FileDataStore and S3DataStore use SHA-1. For new binaries, we should use 
(for example) SHA-256.

Right now, a content management system that uses Oak as the repository can't 
serve those two files at the same time, if it uses the FileDataStore or the 
S3DataStore.

(The FileBlobStore, MongoDB BlobStore,..., are not affected)

Regards,
Thomas





Re: CommitEditors looking for specific child node like oak:index, rep:cugPolicy leads to lots of redundant remote calls

2017-02-23 Thread Thomas Mueller
Hi,

>I like Marcel proposal for "enforcing" use of mixin on parent node to
>indicate that it can have a child node of 'oak:index'. So we can
>leverage mxin 'mix:indexable' (OAK-3725) to mark such parent nodes
>(like root) and IndexUpdate would only look for 'oak:index' node if
>current node has that mixin.

Ah I didn't know about OAK-3725.

I'm a bit worried that we mix different aspects together, not sure which
is better.

"oak:Indexable" is visible, so it can be added and _removed_ by the user.
So when trying to remove that mixin, we would need to check there is no
oak:index child node with nodetype oak:QueryIndexDefinition. We need to
check the nodetype hierarchy. On the other hand, possibly we can enforce
that the parent node of oak:index is oak:Indexable (can we?)

I'm not saying with a hidden property hidden property ":hasOakIndex"
(automatically set and removed) it would be painless. For example when
moving an oak:index node to a new parent, the setting has to be changed at
both the original and the new parents.

Regards,
Thomas




Re: CommitEditors looking for specific child node like oak:index, rep:cugPolicy leads to lots of redundant remote calls

2017-02-23 Thread Thomas Mueller
Hi,

For "oak:index" of type oak:QueryIndexDefinition, what about a hidden
property ":hasOakIndex" = true. That would be
NodeBuilder.hasProperty(":hasOakIndex").

Regards,
Thomas


On 22/02/17 12:57, "Chetan Mehrotra"  wrote:

>We have some CommitEditors in Oak which look for specific child node
>upon each commit like 'oak:index' and 'rep:cugPolicy'
>
>In most cases such child node does not exist and this leads to extra
>remote calls in case of DocumentNodeStore to determine if child with
>such a name exist or not. In case of updates to nodes where child data
>is not cached this quickly adds up and becomes a major portion of
>remote call made from Oak and something which we can avoid
>
>We should look into approaches where such child lookup can be avoided
>in critical write path.
>
>One possible approach is to mark the parent with a specific hidden
>property which has such a node upon addition. This would avoid the
>negative lookup in case of updates
>
>Chetan Mehrotra



Re: Flaky tests due to timing issues

2017-02-21 Thread Thomas Mueller
Hi,

>No I actually meant getting individual time-out values (or a scaling
>factor for time-outs) from CIHelper. That class already provides the
>means to skip tests based on where they are running. So it should be
>relatively straight forward to have it supply scaling factors for
>time-outs in a similar manner.

Do you have an example?

I think timeouts in the order of seconds are problematic, and I don't
think that "scaling" them to 5 seconds or so will fully solve the problem.
Timeouts in the order of minutes are better, but I wouldn't want to
_always_ delay tests that long. That's why I believe using loops is
better. But in that case, configuration seems unnecessary.

Regards,
Thomas



Re: Flaky tests due to timing issues

2017-02-21 Thread Thomas Mueller
Hi,

I assume with (b) you mean: change tests to use loops, combined with very
high timeouts. Example:

Before:

save();
Thread.sleep(1000);
assertTrue(abc());

After:

save();
for(int i=0; !abc() && i<600; i++) {
Thread.sleep(100);
}
assertTrue(abc());



The additional benefit of this logic is that on a fast machine, the test
is faster (only 100 ms sleep instead of 1 second). Disadvantage:
additional complexity, as you wrote (could be avoided with Java 8 lambda
expressions).

Regards,
Thomas



On 21/02/17 13:49, "Michael Dürig"  wrote:

>
>Hi,
>
>I assume that at least some of the tests that sporadically fail on the
>Apache Jenkins fail because of timing issues. To address this we could
>either
>
>a) skip these tests on Jenkins,
>b) increase the time-out,
>c) apply platform dependent time-outs.
>
>
>I would prefer b). I presume that there is no impact on the build time
>unless the build fails anyway because it is running into one of these
>time-outs. If this is not acceptable we could go for b) and provision
>platform dependent time-outs through the CIHelpers class. I somewhat
>dislike the additional complexity though. As last resort we can still do
>a).
>
>WDYT?
>
>Michal



Re: Supporting "resumable" operations on a large tree

2017-02-21 Thread Thomas Mueller
Hi,

For re-indexing, there are two problems actually:

* Indexing can take multiple days, so resume would be nice
* For synchronous indexes, indexing create a large commit, which is
problematic (specially for MongoDB)

To solve both problems ("kill two birds with one stone"), we could instead
try to split indexing into multiple commits. For example use a "fromPath"
.. "toPath" range, and only re-index part of the repository at a time. See
also 
https://issues.apache.org/jira/browse/OAK-5324?focusedCommentId=15837941
ge=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment
-15837941

Regards,
Thomas



On 20/02/17 13:13, "Chetan Mehrotra"  wrote:

>Hi Team,
>
>In Oak many a times we perform operations which traverse the tree as
>part of some processing. For e.g. commit hooks, side grade, indexing
>etc. For small tree this works fine and in case of failure the
>processing can be done again from start.
>
>However for large operations like reindexing whole repository for some
>index this posses a problem. For example consider a Mongo setup having
>100M+ nodes and we need to provision a new index. This would trigger
>an IndexUpdate which would go through all the nodes in the repository
>(in some depth first manner) and then build up the index. This process
>can take long time say 1-2 days for a Mongo based setup.
>
>As with any remote setup such a flow may get interrupted due to some
>network issue or outage on Mongo/RDB side. In such a case the whole
>traversal is started again from start.
>
>Same would be the case for any sidegrade operation where we convert a
>big repository from one form to another.
>
>To improve the resiliency of such operations (OAK-2063) we need a way
>to "resume" traversal in a tree from some last known point. For
>operations performed on a sorted list such a "resume" is easy but
>doing that over a tree traversal looks tricky.
>
>Thoughts on what approach can be taken for enabling this?
>
>May be if we can expect a stable order in traversal at a given
>revision then we can keep track of paths t certain depth and then on
>retry skip processing of subtrees untill untill we get that path
>
>Chetan Mehrotra



Re: [VOTE] Release Apache Jackrabbit Oak 1.6.0

2017-01-30 Thread Thomas Mueller
[X] +1 Release this package as Apache Jackrabbit Oak 1.6.0

Regards,
Thomas



Re: issues introducing non-reversible changes in the repository

2017-01-11 Thread Thomas Mueller
Hi,

I think within a major version of Oak (1.4.x, 1.6.x), there should be no
backward-incompatible data format changes.

If there are changes, then trying to start with an old version (1.2.x)
should fail. It might be possible to open the repository in read-only
mode; for that, then a "read" and a "write" version could be used, as this
is done in SQLite for example (https://www.sqlite.org/fileformat.html -
File Format Version Numbers). So that it's possible to open a repository
in read-only mode, if the read-version is the same. Not sure if it makes
sense to support the read-only mode in Oak.

Regards,
Thomas



On 11/01/17 11:26, "Tomek Rekawek"  wrote:

>Hi,
>
>Some of the Oak users are interested in rolling back the Oak upgrade
>within a branch (like 1.4.10 -> 1.4.1). As far as I understand, it should
>work, unless some of the commits in (1.4.10, 1.4.10] introduces a
>repository format change that is not compatible with the previous version
>(eg. modifies the format of a property in the DocumentMK).
>
>Right now there's no way to check this other than reviewing all the
>issues in the given version range related to the given components.
>
>Maybe it'd be useful to mark such issues with a label (like
>"breaks_compatibility", "non_reversible", "updates_schema", etc.)?
>
>WDYT? Which label should we choose and how we can make sure that it's
>really used in appropriate cases?
>
>Regards,
>Tomek
>
>-- 
>Tomek Rękawek | Adobe Research | www.adobe.com
>reka...@adobe.com
>



Re: RIP Apache Jenkins!?

2016-11-29 Thread Thomas Mueller
Hi,

>And option #4 - donate some computing capacity to run some dedicated
>Jenkins slaves for Oak.

I don't think it's a hardware problem. The problem seems to be turnaround
times from the Apache infra *team*: they seem to be overloaded. It's not
just with Jenkins, see for example:


https://issues.apache.org/jira/browse/INFRA-9709


This issue was created May 2015! With comments from Infra on June and
August 2015, and no activity since then, even after I have asked January
this year. Status: WAITING FOR INFRA

This is just crazy. Either issues get resolved, or they don't, in which
case we should get a notification that they don't.


Regards,
Thomas



Re: [VOTE] Release Apache Jackrabbit Oak 1.2.21

2016-11-15 Thread Thomas Mueller
[X] +1 Release this package as Apache Jackrabbit Oak 1.2.21




Re: segment-tar depending on oak-core

2016-10-25 Thread Thomas Mueller
Hi,

There are two "extreme" cases, and both are used and work fine (please
nobody says "it's a joke", and "monolithic" is worse):

* "Monolithic": Linux, Apache Lucene, and so on: one version for everything

* "Fine grained": Apache Sling: separate, independent versions for
everything

(actually I don't know more examples of "Fine grained")

Apache Sling doesn't really maintain multiple branches in the same way we
do in Oak. I argue that having to maintain multiple branches is easier
with the "monolithic" approach.



Re: segment-tar depending on oak-core

2016-10-21 Thread Thomas Mueller
>
>and using a different release
>cycle for oak-segment-tar is not a problem.

Sorry, I wanted to write "and using a different release cycle for
oak-segment-tar *created new problems*"



Re: segment-tar depending on oak-core

2016-10-21 Thread Thomas Mueller
Hi,

You are sure using many emotional, judgmental words and sentences like
"joke", "embarrassing", "nonsense", "We shouldn't go backward, but
forward", "pet project", "admit", "level of complexity", "doesn't allow",
"dumping grounds". Your whole mail is very judgmental.

OK, I see you would like to split everything into tiny, tiny modules.

Right now we already have many modules, and using a different release
cycle for oak-segment-tar is not a problem.

So your solution is to split things into even more modules. I see that, as
you seem to be very emotional about that.

But I don't agree that's the best solution. I prefer simple solutions,
that don't require a lot of bureaucracy and overhead.

I see no big value in "being able" to release things independently. In
fact I think it's added overhead, with no value.

Regards,
Thomas


On 21/10/16 15:09, "Francesco Mari" <mari.france...@gmail.com> wrote:

>Luckily for us this is not a computer science problem but an easier
>software engineering concern.
>
>The release process in Oak is a joke. Releasing every two weeks by
>using version numbers as counters just for the sake of it is
>embarrassing. I don't even know how many releases of our parent POM we
>have, every one of them equal to the other, and this is nonsense.
>
>We shouldn't go backward, but forward. We need to extract APIs into
>their own independently released bundles. We should split oak-run in
>different CLI utility modules, so that every implementation can take
>better care of their own utilities. Oak is not a pet project and we
>have to admit that its current level of complexity doesn't allow us to
>use oak-core and oak-run as dumping grounds anymore.
>
>2016-10-21 14:08 GMT+02:00 Thomas Mueller <muel...@adobe.com>:
>> Hi,
>>
>>> could adding an oak-core-api with independent lifecycle solve the
>>>situation?
>>
>> "All problems in computer science can be solved by another level of
>> indirection"
>>
>> I would prefer if we get oak-segment-tar in line with the rest of oak
>> (release it at the same time and so on). I understand, there are some
>> disadvantages. But I think all alternatives also have disadvantages.
>>
>> Regards,
>> Thomas
>>
>>
>>
>>
>> On 21/10/16 12:46, "Davide Giannella" <dav...@apache.org> wrote:
>>
>>>Hello team,
>>>
>>>while integrating Oak with segment-tar in other products, I'm facing
>>>quite a struggle with a sort-of circular dependencies. We have
>>>segment-tar that depends on oak-core and then we have tools like oak-run
>>>or oak-upgrade which depends on both oak-core and segment-tar.
>>>
>>>this may not be an issue but in case of changes in the API, like for
>>>1.5.12 we have the following situation. 1.5.12 has been released with
>>>segment-tar 0.0.14 but this mix doesn't actually work on OSGi
>>>environment as of API changes. On the other hand, in order to release
>>>0.0.16 we need oak-core 1.5.12 with the changes.
>>>
>>>Now oak-run and other tools may fail, or at least be in an unknown
>>>situation.
>>>
>>>All of this is my understanding and I may be wrong, so please correct me
>>>if I'm wrong. I'm right, could adding an oak-core-api with independent
>>>lifecycle solve the situation?
>>>
>>>Davide
>>>
>>>
>>



Re: segment-tar depending on oak-core

2016-10-21 Thread Thomas Mueller
Hi,

>The release process in Oak is a joke.

I don't think it's a joke.

> Releasing every two weeks by
>using version numbers as counters just for the sake of it is
>embarrassing.

Why? It's simple.

> I don't even know how many releases of our parent POM we
>have, every one of them equal to the other, and this is nonsense.

"Nonsense"... again a word without explanation.

>We shouldn't go backward, but forward.

It depends on what "backward is". I would prefer if we make things
"simpler".

> We need to extract APIs into
>their own independently released bundles.

I don't think we need to do that. The "release everything at once" sounds
good to me.

> We should split oak-run in
>different CLI utility modules

Split, split, and again split. Why? What is the advantage?

>, so that every implementation can take
>better care of their own utilities.

It's the Oak utilities. I think the current organization is just fine.

>Oak is not a pet project

Again, you are using strong words ("pet"), but without real explanation...
How is it that your definition of "pet" is the only valid one?

>and we
>have to admit that its current level of complexity doesn't allow us to
>use oak-core and oak-run as dumping grounds anymore.

Again a strong word... "dump".

I just don't see how making tiny "ravioli" modules makes things any
better. It surely makes things more complex, as we see with
oak-segment-tar: it forces to add even more and more modules, to be able
to deal with the consequences of adding modules.

Regards,
Thomas



Re: segment-tar depending on oak-core

2016-10-21 Thread Thomas Mueller
Hi,

> could adding an oak-core-api with independent lifecycle solve the
>situation?

"All problems in computer science can be solved by another level of
indirection"

I would prefer if we get oak-segment-tar in line with the rest of oak
(release it at the same time and so on). I understand, there are some
disadvantages. But I think all alternatives also have disadvantages.

Regards,
Thomas




On 21/10/16 12:46, "Davide Giannella"  wrote:

>Hello team,
>
>while integrating Oak with segment-tar in other products, I'm facing
>quite a struggle with a sort-of circular dependencies. We have
>segment-tar that depends on oak-core and then we have tools like oak-run
>or oak-upgrade which depends on both oak-core and segment-tar.
>
>this may not be an issue but in case of changes in the API, like for
>1.5.12 we have the following situation. 1.5.12 has been released with
>segment-tar 0.0.14 but this mix doesn't actually work on OSGi
>environment as of API changes. On the other hand, in order to release
>0.0.16 we need oak-core 1.5.12 with the changes.
>
>Now oak-run and other tools may fail, or at least be in an unknown
>situation.
>
>All of this is my understanding and I may be wrong, so please correct me
>if I'm wrong. I'm right, could adding an oak-core-api with independent
>lifecycle solve the situation?
>
>Davide
>
>



Re: On adding new APIs

2016-10-20 Thread Thomas Mueller
Hi,

I would prefer C-T-R (commit, then review), because it reduces
bureaucracy. Except for changes just before a major release (when there is
little time to und or change things).


+1 to [REVIEW] emails. In my view, this should include new configuration
and new features. Basically all "important" user-visible behavior changes.
Sure, it's hard to say what is important and what is not.

Regards,
Thomas




On 20/10/16 14:21, "Bertrand Delacretaz"  wrote:

>On Thu, Oct 20, 2016 at 2:02 PM, Michael Marth  wrote:
>> ...So I have a proposal: when a new public API is added the developer
>>should drop an
>> email with subject tag [REVIEW] onto the dev list, so that others are
>>aware and can
>> chime in if needed...
>
>The Oak team can also just decide that new APIs are to be handled in
>R-T-C mode instead of C-T-R.
>
>It's kind of the same thing but using standard Apache terminology ;-)
>
>-Bertrand



Re: Oak query performance problem

2016-10-19 Thread Thomas Mueller
Hi,

There is the "nodetype" index (/oak:index/nodetype) which is normally used
for such queries. I would change that index.

Regards,
Thomas


On 18/10/16 17:08, "Roy Teeuwen" <r...@teeuwen.be> wrote:

>Hey Thomas,
>
>Ok perfect, laying an oak:QueryIndexDefinition on propertyNames
>jcr:primaryType and declaringNodeTypes rep:ACL solved the issue.
>I presumed there already was an index for all the existing
>jcr:primaryTypes :), not that you have to specifically have them in the
>declaringNodeTypes
>
>Thanks!
>Roy
>> On 18 Oct 2016, at 16:53, Thomas Mueller <muel...@adobe.com> wrote:
>> 
>> Hi,
>> 
>>> I really don¹t see the reason why this could be such a hard query
>> 
>> 
>> Who said it's a hard query? :-)
>> 
>> Is the problem performance, or is the problem that you get an exception?
>> 
>> 
>> If the problem is performance, then you need an index on the node type
>> rep:ACL.
>> 
>> If the problem is the exception: In your case the query engine is
>> configured to fail the queries (the reason is written in the exception
>> message). You can change the limit using the JMX bean
>> "QueryEngineSettings". The default is Long.MAX_VALUE (virtually no
>>limit)
>> by the way.
>> 
>> Regards,
>> Thomas
>> 
>> 
>> 
>> On 18/10/16 16:44, "Roy Teeuwen" <r...@teeuwen.be> wrote:
>> 
>>> Hello all,
>>> 
>>> I got a problem in oak concerning query performance for the following
>>> simple queries
>>> 
>>> SELECT * FROM [rep:ACL] WHERE ISDESCENDANTNODE([/content])
>>> SELECT * FROM [rep:ACL] WHERE ISDESCENDANTNODE([/var])
>>> 
>>> I get the following exception:
>>> 
>>> java.lang.UnsupportedOperationException: The query read or traversed
>>>more
>>> than 15 nodes. To avoid affecting other tasks, processing was
>>>stopped.
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.query.FilterIterators.checkReadLimit(FilterIte
>>>ra
>>> tors.java:66)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.spi.query.Cursors$TraversingCursor.fetchNext(C
>>>ur
>>> sors.java:324)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.spi.query.Cursors$TraversingCursor.next(Cursor
>>>s.
>>> java:303)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.query.ast.SelectorImpl.next(SelectorImpl.java:
>>>40
>>> 9)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.fetchNext(QueryImp
>>>l.
>>> java:773)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.hasNext(QueryImpl.
>>>ja
>>> va:798)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.fetch(QueryResultI
>>>mp
>>> l.java:181)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.next(QueryResultIm
>>>pl
>>> .java:207)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.next(QueryResultIm
>>>pl
>>> .java:170)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate$SynchronizedItera
>>>to
>>> r.next(SessionDelegate.java:694)
>>> at 
>>> 
>>>org.apache.jackrabbit.oak.jcr.query.PrefetchIterator.next(PrefetchIterat
>>>or
>>> .java:97)
>>> at 
>>> 
>>>org.apache.jackrabbit.commons.iterator.RangeIteratorAdapter.next(RangeIt
>>>er
>>> atorAdapter.java:152)
>>> at 
>>> 
>>>org.apache.jackrabbit.commons.iterator.RangeIteratorDecorator.next(Range
>>>It
>>> eratorDecorator.java:92)
>>> at 
>>> 
>>>org.apache.jackrabbit.commons.iterator.NodeIteratorAdapter.nextNode(Node
>>>It
>>> eratorAdapter.java:80)
>>> at 
>>> 
>>>biz.netcentric.cq.tools.actool.helper.QueryHelper.getNodes(QueryHelper.j
>>>av
>>> a:128)
>>> at 
>>> 
>>>biz.netcentric.cq.tools.actool.helper.QueryHelper.getRepPolicyNodes(Quer
>>>yH
>>> elper.java:90)
>>> at 
>>> 
>>>biz.netcentric.cq.tools.actool.dumpservice.impl.DumpserviceImpl.getACLDu
>>>mp
>>> Beans(DumpserviceImpl.java:399)
>>> 
>>> Of course there is an oak:index on jcr:primaryType, so I really don¹t
>>>see
>>> the reason why this could be such a hard query to search for nodes
>>>under
>>> a path that are of type rep:ACL?
>>> (If you want more background, this query is used in the netcentric AC
>>> Tool to make a dump of all the existing rep policy ACL nodes)
>>> 
>>> Greetings,
>>> Roy
>>> 
>>> 
>> 
>



Re: Oak query performance problem

2016-10-18 Thread Thomas Mueller
Hi,

> I really don¹t see the reason why this could be such a hard query


Who said it's a hard query? :-)

Is the problem performance, or is the problem that you get an exception?


If the problem is performance, then you need an index on the node type
rep:ACL.

If the problem is the exception: In your case the query engine is
configured to fail the queries (the reason is written in the exception
message). You can change the limit using the JMX bean
"QueryEngineSettings". The default is Long.MAX_VALUE (virtually no limit)
by the way.

Regards,
Thomas



On 18/10/16 16:44, "Roy Teeuwen"  wrote:

>Hello all,
>
>I got a problem in oak concerning query performance for the following
>simple queries
>
>SELECT * FROM [rep:ACL] WHERE ISDESCENDANTNODE([/content])
>SELECT * FROM [rep:ACL] WHERE ISDESCENDANTNODE([/var])
>
>I get the following exception:
>
>java.lang.UnsupportedOperationException: The query read or traversed more
>than 15 nodes. To avoid affecting other tasks, processing was stopped.
>   at 
>org.apache.jackrabbit.oak.query.FilterIterators.checkReadLimit(FilterItera
>tors.java:66)
>   at 
>org.apache.jackrabbit.oak.spi.query.Cursors$TraversingCursor.fetchNext(Cur
>sors.java:324)
>   at 
>org.apache.jackrabbit.oak.spi.query.Cursors$TraversingCursor.next(Cursors.
>java:303)
>   at 
>org.apache.jackrabbit.oak.query.ast.SelectorImpl.next(SelectorImpl.java:40
>9)
>   at 
>org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.fetchNext(QueryImpl.
>java:773)
>   at 
>org.apache.jackrabbit.oak.query.QueryImpl$RowIterator.hasNext(QueryImpl.ja
>va:798)
>   at 
>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.fetch(QueryResultImp
>l.java:181)
>   at 
>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.next(QueryResultImpl
>.java:207)
>   at 
>org.apache.jackrabbit.oak.jcr.query.QueryResultImpl$3.next(QueryResultImpl
>.java:170)
>   at 
>org.apache.jackrabbit.oak.jcr.delegate.SessionDelegate$SynchronizedIterato
>r.next(SessionDelegate.java:694)
>   at 
>org.apache.jackrabbit.oak.jcr.query.PrefetchIterator.next(PrefetchIterator
>.java:97)
>   at 
>org.apache.jackrabbit.commons.iterator.RangeIteratorAdapter.next(RangeIter
>atorAdapter.java:152)
>   at 
>org.apache.jackrabbit.commons.iterator.RangeIteratorDecorator.next(RangeIt
>eratorDecorator.java:92)
>   at 
>org.apache.jackrabbit.commons.iterator.NodeIteratorAdapter.nextNode(NodeIt
>eratorAdapter.java:80)
>   at 
>biz.netcentric.cq.tools.actool.helper.QueryHelper.getNodes(QueryHelper.jav
>a:128)
>   at 
>biz.netcentric.cq.tools.actool.helper.QueryHelper.getRepPolicyNodes(QueryH
>elper.java:90)
>   at 
>biz.netcentric.cq.tools.actool.dumpservice.impl.DumpserviceImpl.getACLDump
>Beans(DumpserviceImpl.java:399)
>
>Of course there is an oak:index on jcr:primaryType, so I really don¹t see
>the reason why this could be such a hard query to search for nodes under
>a path that are of type rep:ACL?
>(If you want more background, this query is used in the netcentric AC
>Tool to make a dump of all the existing rep policy ACL nodes)
>
>Greetings,
>Roy
>
>



Re: Default setup configured to index all nodetype

2016-10-18 Thread Thomas Mueller
Hi,

This is an old problem, but never solved. See OAK-1150.

Regards,
Thomas


On 17/10/16 16:08, "Chetan Mehrotra"  wrote:

>Hi Team,
>
>While doing some benchmarks I realized that default setup is
>configured to index *all* nodetypes. In InitialContent the nodetype
>index is configured like
>
>NodeBuilder nodetype = IndexUtils.createIndexDefinition(index,
>"nodetype", true, false,
>ImmutableList.of(JCR_PRIMARYTYPE, JCR_MIXINTYPES),
>null /*declaringNodeTypeNames*/);
>
>As last param declaringNodeTypeNames is null all nodetypes gets indexed
>
>Is that intentional for default setup? I see its the way since very
>beginning but just wanted to check if we should revisit this
>
>Chetan Mehrotra



Re: Possibility of making nt:resource unreferenceable

2016-10-12 Thread Thomas Mueller
Hi,
>
>Currently I am under the impression that we have no knowledge of what
>*might* break, with varying opinions on the matter. Maybe we should to
>find out what *does* break.

I don't think it's possible to easily find out. Customer code might expect
the current behavior, and might silently break (without exception, but
with wrong behavior).

>
>As a remedy for implementations that rely on the current referencable
>nature, we could provide tooling that automatically adds the
>"mix:referencable" mixin to existing nt:resource nodes and recommend
>adapting the code to add the mixin as well.

That might work, but in some cases it might also result in problems (if
the code expects this not to be the case).

Regards,
Thomas



Re: Possibility of making nt:resource unreferenceable

2016-10-12 Thread Thomas Mueller
Hi,

I agree with Julian, I think making nt:resource unreferenceable would
(hardcoding some "magic" in Oak) would lead to hard-to-find bugs and
problems.

> So whatever solution we pick, there is a risk that existing code fails.

Yes. But I think if we create a new nodetype, at least it would be easier
for users to understand the problem.

Also, the "upgrade path" with a new nodetype is smoother. This can be done
incrementally, even thought it might mean more total work. But making
nt:resource unreferenceable would be a hard break, and I think risk of
bigger problems is higher.

Regards,
Thomas



On 07/10/16 12:05, "Julian Reschke"  wrote:

>On 2016-10-07 10:56, Carsten Ziegeler wrote:
>> Julian Reschke wrote
>>> On 2016-10-07 08:04, Carsten Ziegeler wrote:
 ...
 The easiest solution that comes to my mind is:

 Whenever a nt:resource child node of a nt:file node is created, it is
 silently changed to oak:resource.

 Carsten
 ...
>>>
>>> Observation: that might break code that actually wants a referenceable
>>> node: it would create the node, check for the presence of
>>> mix:referenceable, and then decide not to add it because it's already
>>> there.
>>>
>>
>> Well, there might be code that assumes that a file uploaded through
>> webdav is using a resource child node that is referenceable.
>> Or a file posted through the Sling POST servlet has this. Now, you could
>> argue if that code did not create the file, it should check node types,
>> but how likely is that if the code has history?
>>
>> So whatever solution we pick, there is a risk that existing code fails.
>> ...
>
>That is true..
>
>However, my preference would be to only break code which is
>non-conforming right now. Code should not rely on nt:resource being
>referenceable (see
>ml#3.7.11.5%20nt:resource>).
>
>So my preference would be to make that change and see what breaks (and
>get that fixed).
>
> > ...
>
>
>Best regards, Julian



Re: [VOTE] Release Apache Jackrabbit Oak 1.2.20

2016-10-12 Thread Thomas Mueller
[X] +1 Release this package as Apache Jackrabbit Oak 1.2.20




Re: XPath query

2016-10-11 Thread Thomas Mueller
Hi,

Sorry typo in "type", wanted to write "typo":

>I thought even in Jackrabbit 2.x, the "test" was assumed to be a type and
>automatically converted to "@test"...

Should read:

I thought even in Jackrabbit 2.x, the "test" was assumed to be a typo ...


Regards,
Thomas



Re: [VOTE] Release Apache Jackrabbit Oak 1.5.12

2016-10-11 Thread Thomas Mueller
[X] +1 Release this package as Apache Jackrabbit Oak 1.5.12




Re: XPath query

2016-10-11 Thread Thomas Mueller
Hi,

I thought even in Jackrabbit 2.x, the "test" was assumed to be a type and
automatically converted to "@test"... Maybe I'm wrong.

What should work (for both Jackrabbit 2.x and Oak) is using
"test/@jcr:primaryType" instead of "test". So:

/jcr:root//*[test/@jcr:primaryType]
/jcr:root/content/site//element(*,nt:unstructured)
[@jcr:createdBy='admin' and test/@jcr:primaryType]


Regards,
Thomas

On 07/10/16 17:42, "Roy Teeuwen"  wrote:

>Hey all,
>
>Seeing as I don¹t seem to find a oak-users to subscribe to, I¹m going to
>post the question here:
>
>When doing the following XPath query in JCR 2, it would select me all the
>nodes that has a subnode named test. But since oak, this query does not
>work anymore. Is there a reason this stopped working or a way to make it
>work again
>
>Some query example:
>/jcr:root//*[test] or
>/jcr:root/content/site//element(*,nt:unstructured)[@jcr:createdBy='admin'
>and test]
>
>Greetings,
>Roy



Re: Faster reference binary handling

2016-09-16 Thread Thomas Mueller
Hi,

Possibly the binary is downloaded from S3 in this case. We have seen
similar performance issues with datastore GC when using the S3 datastore.

It should be possible to verify this with full thread dumps. Plus we would
see where exactly the download occurs. Maybe it is checking the length or
so.

> this API requires Oak to always retrieve the binary value from the DS

I think the problem is in the S3 datastore implementation, and not the
API. But lets see.

Regards,
Thomas


On 15/09/16 18:04, "Tommaso Teofili"  wrote:

>Hi all,
>
>while working with Oak S3 DS I have witnessed slowness (no numbers, just
>'slow' from a user perspective) in persisting a binary using its
>reference;
>although this may be related to some environment specific issue I wondered
>about the reference binary handling we introduced in JCR-3534 [1].
>In fact the implementation there requires to do something like
>
>ReferenceBinary ref = new SimpleReferenceBinary(referenceString);
>Binary referencedBinary =
>session.getValueFactory().createValue(ref).getBinary();
>node.setProperty("foo", referencedBinary);
>
>on the "installation" side.
>Despite all possible issues in the implementation it seems this API
>requires Oak to always retrieve the binary value from the DS and then
>store
>its value into the node whereas it'd be much better to avoid having to
>read
>the value but instead bind it to that referenced binary.
>
>ReferenceBinary ref = new SimpleReferenceBinary(referenceString);
>if (ref.isValid()) { // referenced binary exists in the DS
>  node.setProperty("foo", ref, Type.BINARY); // set a string with binary
>type !?
>}
>
>I am not sure if the above code could make sense, probably not, but at
>least wanted to point out the problem as to seek for possible
>enhancements.
>
>Regards,
>Tommaso
>
>[1] : https://issues.apache.org/jira/browse/JCR-3534



Re: [suggestion] introduce oak compatibility levels

2016-07-28 Thread Thomas Mueller
Hi,

>I agree if conflicts conceptually with MVCC. However: is there an actual
>problem with the auto-refresh behaviour?

Yes. For example with queries. If changes are made while iterating over
the result of a query, the current behavior is problematic. Example code
(simplified):

RowIterator it = xxx.createQuery(...).execute().getRows();
while (it.hasNext()) {
otherSession.getNode(...).remove();
otherSession.save();
Row row = it.nextRow();
Node node = row.getNode();
-> node can be null here!
}


So basically the query result contains entries that get removed (by
another session) while iterating over the result. So this can lead to
NullPointerException and other strange behavior (you could get nodes that
no _longer_ match the query constraints), depending on what you do
exactly. Arguably it would be better if the session is isolated from
changes done in another session in the same thread. By the way if using
the same session to remove nodes and iterate over the result, the query
result has to reflect the changes done by the session (I think this is
required by the JCR spec).

Regards,
Thomas



Re: Child count function in the the query language

2016-07-21 Thread Thomas Mueller
Hi,

I'm sorry this feature is not available. You would need to set a property
"childCount" explicitly.

Could you explain the use case please?

Regards,
Thomas



On 20/07/16 15:48, "Milan Milanov"  wrote:

>Hello there,
>
>I¹m trying to order some nodes by how many child nodes they have and i¹ve
>stumped upon a headache.
>My query looks like this:
>
>"SELECT * FROM 'nt:hierarchyNode' AS node
>WHERE ISCHILDNODE(node, Œ/some/given/path¹)
>ORDER BY 'jcr:primaryType' DESC, LENGTH(node) ASC²
>
>This is taking all child nodes of a given node and ordering the
>nt:folders first (before the nt:files). However, the
>LENGTH(node) does not seem to be doing what I want it to do (count the
>number of children). Is there anything
>oak-specific that could solve my problem, like Modeshape¹s CHILDCOUNT()
>or like having a custom function
>somehow?
>
>Kind regards,
>Milan Milanov



Re: Usecases around Binary handling in Oak

2016-06-07 Thread Thomas Mueller
Hi,

> I still don't believe that Oak is the right place to implement these
>solutions.

What would be the right place then? The Oak user can store the path of the
file as a string, but he would lose some features (garbage collection for
example).

>Every use case you outlined requires Oak to expose the location of the
>binary objects in the underlying storage.

I don't think every one. I think only UC1 and UC8 need it, read-only. If
not already on the file system, we could copy the file to the file system
(for example to the temp directory).

UC2: already supported using references


UC3: could be implemented with "fast random access reads" and changes in
Tika.


UC4: could we add a method "writeTo(WritableByteChannel target)"?

UC5: The SHA-1 hash could be exposed if available, I don't see why not.
Plus maybe UC1 or UC4.


UC6: sounds like UC5

UC7: we would need details (how many writes, do we need a new identifier
for each write operation,...). Can be implemented quite efficiently for
the BlobStore implementations (MongoBlobStore / RDBBlobStore /
FileBlobStore).


>As soon as a file path, a file
>descriptor or an S3 object ID traverses the boundary between Oak and its
>clients, all bets are off.

Well, we would need to define the exact contract, and maybe access rights.

> is the correctness of Oak depending on the behaviour of the user?

To some extend, this is already the case.

Regards,
Thomas



Re: [VOTE] Please vote for the final name of oak-segment-next

2016-04-26 Thread Thomas Mueller
Hi,

I would keep the "oak-segment-*" name, so that it's clear what it is based
on. So:

-1 oak-local-store
-1 oak-embedded-store

+1 oak-segment-*

Within the oak-segment-* options, I don't have a preference.

Regards,
Thomas


On 25/04/16 16:46, "Michael Dürig"  wrote:

>
>Hi,
>
>There is a couple of names that came up in the discussion [1]:
>
>oak-local-store
>oak-segment-file
>oak-embedded-store
>oak-segment-store
>oak-segment-tar
>oak-segment-next
>
>Please vote which of the above six options you would like to see as the
>final name for oak-segment-next [2]:
>
>Put +1 next to those names that you favour, put -1 to veto names and
>remove the remaining names. Please justify any veto as otherwise it is
>non binding.
>
>The name with the most +1 votes and without any -1 vote will be chosen.
>
>The vote is open for the next 72 hours.
>
>Michael
>
>
>[1] http://markmail.org/thread/ktk7szjxtucpqd2o
>[2] https://issues.apache.org/jira/browse/OAK-4245



Re: [VOTE] Release Apache Jackrabbit Oak 1.4.0 (take 3)

2016-03-08 Thread Thomas Mueller
+1 Release this package as Apache Jackrabbit Oak 1.4.0




Re: cache and index backup and restore ?

2016-01-21 Thread Thomas Mueller
Hi,

Sure, there is a performance advantage (for both the persistent cache and
the Lucene index cache). But how much exactly depends on the use case.

You forgot the "persistent cache" by the way.

When restoring, you need to ensure that the local cache is not newer than
the remote (MongoDB), and from the same "branch" (when copying and
branching MongoDB databases).

> Also, if I tar up everything to restore multiple times, is there
>anything I
need to edit on disk to make the instances distinct.

No, not that I know of.

Regards,
Thomas



On 21/01/16 14:56, "ianbos...@gmail.com on behalf of Ian Boston"
 wrote:

>Hi,
>Having done a cold backup of a MongoMK instance with a FS Datastore, is
>there any advantage in also backing up the local disk copy of the lucene
>index (normally in repository/index/** ) and persistent cache file
>(repository/cache/**) so that it can be restored on more than one Oak
>instance in the cluster. or do both those subtrees get zapped when the
>new instance starts ?
>
>Also, if I tar up everything to restore multiple times, is there anything
>I
>need to edit on disk to make the instances distinct. IIRC there was a
>sling.id at one point, but that might have been JR2 rather than Oak.
>
>Best Regards
>Ian



Re: Restructure docs

2016-01-20 Thread Thomas Mueller
Hi,

I'm not in favour of this, as it breaks links, and I don't see a clear
improvement. I'm more in favour of incremental, small changes.

> an easier way to add/update documentation about oak specific features.

Sorry I don't understand, what is the problem with the current approach?
Is the menu too large? Sure, we could add some structure, but I wouldn't
just add "features".

>
>Currently we have a section in the left-hand menu: Features and plugins.

I think "plugins" is an arbitrary name, it might as well just be
"features".


>This mean that if we add a feature over there we have to update the
>whole doc site as the menu is eventually static in all pages.

I'm sorry I don't understand. The left-hand menu is in site.xml (in one
place). I don't think we will add all that many "features" to Oak in the
future.

Regards,
Thomas



Re: Restructure docs

2016-01-20 Thread Thomas Mueller
Hi,

I also always deploy the whole site with maven.

Regards,
Thomas

On 20/01/16 10:16, "Davide Giannella" <dav...@apache.org> wrote:

>On 20/01/2016 08:14, Thomas Mueller wrote:
>> ...
>>> This mean that if we add a feature over there we have to update the
>>> whole doc site as the menu is eventually static in all pages.
>> I'm sorry I don't understand. The left-hand menu is in site.xml (in one
>> place). I don't think we will add all that many "features" to Oak in the
>> future.
>>
>
>When you change/add/remove an item from the left-hand menu, you'll have
>to redeploy the whole site as it will be hardcoded within the html of
>each page. Deploying the whole website is a long process. Therefore
>limiting the changes over there make things faster.
>
>With regards to the fact that we won't add that many features in the
>section I hope it won't be true. I see it as a section where we document
>how to use or how it works a particular feature. So I'm really hoping
>we'll add more and more.
>
>Davide
>
>



Re: oak-run 50MB

2015-12-10 Thread Thomas Mueller
Hi,

Could we get rid of unused stuff? Like Hadoop (7 MB!). Do we need Solr
(2.3 MB), Tika, Zookeeper, Jetty, H2 (the SQL part)? Do we need the
Jackrabbit remoting stuff? I guess we need Groovy (4 MB) and Lucene (4 MB).

Of those 50MB, just 8% is Oak, and the rest is dependencies.


Regards,
Thomas






On 10/12/15 11:38, "Davide Giannella"  wrote:

>On 22/07/2015 10:23, Davide Giannella wrote:
>> On 20/07/2015 14:12, Julian Sedding wrote:
>>> +1 It sounds sensible to split this up. It seems that it has evolved
>>> into a collection of functionality that shares mostly the fact that
>>> they are run on the command line. I would like to see the logic used
>>> to bootstrap various Oak setups based on command line parameters to be
>>> extracted and re-used.
>> I've created https://issues.apache.org/jira/browse/OAK-3134 to keep
>> track of an initial investigation.
>>
>> Can I ask everyone to jump on it adding their own knowledge and
>>suggestions?
>>
>
>As we're speaking of modularisation I'm reviving this thread.  Please
>see the ticket https://issues.apache.org/jira/browse/OAK-3134 for details.
>
>I will file separate tickets for the actual actions, around moving each
>individual functionality to own bundle if no one will object.
>
>Cheers
>Davide
>
>



Re: Semantic version in Oak

2015-12-08 Thread Thomas Mueller
Hi,

I think the main difference between Oak and Sling is, AFAIK, that Sling is
"forward only", and does not maintain branches, and does not backport
things.

In Oak, we add new features in trunk (changing the API), and backport some
of those features, and not necessarily all of them, and not necessarily in
the same order as they were implemented:

== Trunk ==

add feature A => bump export version to 1.1
... later on ...
add feature B => bump export version to 2.0
... later on ...
add feature C => bump export version to 2.1


== Branch ==

backport feature C => bump export version to ?
... later on ...
backport feature A => bump export version to ?


Regards,
Thomas



On 08/12/15 09:41, "Michael Dürig"  wrote:

>
>>> Packages evolve independently, but they do in potentially
>>> divergent branches. This is the kind of timeline that we usually
>>> face:
>>>
>>> - Oak 1.4 has a package org.foo.bar 1.0 - Some changes happen on
>>> the development branch 1.5 - Oak 1.5 now has a package org.foo.bar
>>> 1.1 - A change X happen in the development branch 1.5 - Oak 1.5 now
>>> has a package org.foo.bar 1.2 - The change X has to be backported
>>> to the maintenance branch 1.4 - Oak 1.4 now should have a package
>>> org.foo.bar 1.1
>>>
>>> Assuming that the versions were incremented following the semantic
>>> versioning rules, we now have two packages - both called
>>> org.foo.bar and both having version 1.1 - that live on two
>>> different branches and contain different code.
>>>
>>> The only obvious solution that comes to my mind is to bump the
>>> major version of every package right after the development branch
>>> 1.5 is started, but I don't like this approach very much because it
>>> would break compatibility with existing clients for no obvious
>>> reason.
>>
>> This scenario is the exact problem you are facing while branching and
>> evolving the branches in parallel to trunk.
>>
>> The only end-developer friendly solution is to byte the bullet and do
>> it really properly and make sure you evolve exported packages (being
>> your API) in a truly diligent matter: Consider a package name and its
>> export version as the package¹s identity and always make sure this
>> identity (label) refers to the identical exported API.
>>
>
>I fail to see how this would work with branches. For Francesco's example
>this would mean that we'd need to backport everything into the branch
>effectively aligning it with trunk and thus obviating its purpose.
>
>Michael
>
>
>



Re: jackrabbit-oak build #6972: Broken

2015-12-01 Thread Thomas Mueller
"Out of heap space"

On 27/11/15 18:54, "Travis CI"  wrote:

>Build Update for apache/jackrabbit-oak
>-
>
>Build: #6972
>Status: Broken
>
>Duration: 420 seconds
>Commit: 3f083e0134aca930ed44bdb5a19ccff9794aef1f (trunk)
>Author: Julian Reschke
>Message: OAK-2655: Test failure: OrderableNodesTest.testAddNode -
>reenabling test
>
>git-svn-id: https://svn.apache.org/repos/asf/jackrabbit/oak/trunk@1716901
>13f79535-47bb-0310-9956-ffa450edef68
>
>View the changeset:
>https://github.com/apache/jackrabbit-oak/compare/9ce2d9e4f7c4...3f083e0134
>ac
>
>View the full build log and details:
>https://travis-ci.org/apache/jackrabbit-oak/builds/93563971
>
>--
>sent by Jukka's Travis notification gateway



Re: Test failures in o.a.j.o.plugins.document.persistentCache.BroadcastTest

2015-11-24 Thread Thomas Mueller
Hi,

I will disabled those tests. Even thought the unit tests always worked for
me, I couldn't get UDP to work with two "real" repositories (two
processes).

Regards,
Thomas


On 24/11/15 13:33, "Francesco Mari"  wrote:

>broadcastUDP



Re: Start multiple instances of Oak throws an CommitFailedException

2015-10-27 Thread Thomas Mueller
Hi,

I would say initializing the collections (or tables when using a
relational database) is not expected to be done concurrently. Maybe we can
somehow preven that in Oak (patches are welcome!), or we document that
this is not supported.

Regards,
Thomas



On 26/10/15 18:36, "Navarro, Gabriela Matias" 
wrote:

>Hi everyone,
>
>I'm working on a project where we need to deploy multiple instances of
>Oak pointing to the same MongoDB. The problem is that sometimes in the
>initialization phase (initialize method of OakInitializer class) a
>CommitFailedException happens and what  I understood of why this happens
>it's because if two or more instances tries to initialize the collection
>on MongoDB at the same time, it will get a conflict on the merge.
>
>This initialization problem affects the initialization of others services
>from a OSGI environment.
>
>Does anyone know how to fix this problem ?
>
>Thanks,
>
>Gabriela Navarro



Re: Reindexing problems

2015-10-21 Thread Thomas Mueller
OK, I think we (kind of) agree on how to ensure important indexes are
available.

>>Additionally, for "synchronous" indexes (property index and so on), I
>>would like to always create and reindex them asynchronously by default,

OK, I see that large branches are a problem.

Instead of using branches, what about:

* First switch the index to "building in progress" so that _queries_ don't
use it. 

* Build the index in multiple commits:
  - Traverse the repository, and
   - as soon as you have 1000 index changes in memory, commit them.
* Then continue to traverse, in a new transaction.
* Until the repository is fully traversed.
* Concurrent changes would update the index as normal.
* At the of the "index creation traversal", switch the index to "ready"

Regards,
Thomas



Reindexing problems

2015-10-21 Thread Thomas Mueller
Hi,

If an index provider is (temporarily) not available, the 
MissingIndexProviderStrategy resets the index so it is re-indexed. This is a 
problem (OAK-2024, OAK-2203, OAK-2429, OAK-3325, OAK-3366, OAK-3505, OAK-3512, 
OAK-3513), because re-indexing is slow and one transaction. It can also cause 
many threads to concurrently build the index. Currently, synchronous indexes 
are built in one "transaction", which is anyway a performance problem (for new 
indexes and reindexing). If an index is not available when running a query, 
traversal is used, which is also a problem.

What about:

* (a) Hardcode (not rely on the Whiteboard or OSGi) the known indexes for 
property, reference, nodeType, lucene, counter index. This is for both writing 
(IndexEditor) and reading (QueryIndex) . That way, those indexes are always 
available, and we never get into a situation where they are temporarily not 
available.

* (b) Where we can't use hardcoding, use hard service references (Whiteboard / 
OSGi).

* (c) If we can't do that, block or fail commits if one of the configured 
indexes is not available, for example for the Solr index (if such an index is 
configured).

Additionally, for "synchronous" indexes (property index and so on), I would 
like to always create and reindex them asynchronously by default, and only once 
they are available switch to sychronous mode. I think (but I'm not sure) this 
is OAK-1456.

What do you think?

Regards,
Thomas



Re: jackrabbit-oak build #6598: Broken

2015-10-09 Thread Thomas Mueller
Hi,

> some fix missing in the 1.2 branch?

You are right, it looks like the cause is OAK-3432. This is fixed in the
trunk but not in the branch, because I thought it doesn't affect the
branch (because the branch doesn't contain OAK-3234). But, now I see it
also affects the branch.


Regards,
Thomas

On 08/10/15 09:27, "Marcel Reutegger"  wrote:

>the failure is:
>
>Failed tests:   
>testLoaderBlock(org.apache.jackrabbit.oak.cache.ConcurrentTest): Had to
>wait unexpectedly long for other threads: 1207
>
>
>looks unrelated to Chetan's change.
>
>is the threshold to low for the test or some fix missing in the 1.2
>branch?
>
>Regards
> Marcel
>
>On 08/10/15 09:00, "Travis CI" wrote:
>
>>Build Update for apache/jackrabbit-oak
>>-
>>
>>Build: #6598
>>Status: Broken
>>
>>Duration: 703 seconds
>>Commit: 8e810be6066862f415f0067248a29a945e0d8d13 (1.2)
>>Author: Chetan Mehrotra
>>Message: OAK-3476 - Memory leak caused by using marker names based on non
>>static session id
>>
>>Merging 1707435
>>
>>
>>git-svn-id: 
>>https://svn.apache.org/repos/asf/jackrabbit/oak/branches/1.2@1707437
>>13f79535-47bb-0310-9956-ffa450edef68
>>
>>View the changeset:
>>https://github.com/apache/jackrabbit-oak/compare/e4567a6c224a...8e810be60
>>6
>>68
>>
>>View the full build log and details:
>>https://travis-ci.org/apache/jackrabbit-oak/builds/84246458
>>
>>--
>>sent by Jukka's Travis notification gateway
>



Re: [Oak] Lucene copyonread OOM

2015-10-09 Thread Thomas Mueller
Hi,

Is this a 32-bit or 64-bit JVM?

Could you try

ulimit -v unlimited

See 
http://stackoverflow.com/questions/8892143/error-when-opening-a-lucene-index-map-failed
and possibly 
http://stackoverflow.com/questions/11683850/how-much-memory-could-vm-use-in-linux

Regards,
Thomas


From: Geoffroy Schneck >
Date: Friday 9 October 2015 16:01
To: "oak-dev@jackrabbit.apache.org" 
>, DL-tech 
>
Subject: [Oak] Lucene copyonread OOM

Hello Oak Experts,

On an Oak 1.2.4 version, OOM are thrown quite regularly by the copyonread 
feature , see below.

However, the system where it runs, has 32GB total, and JVM –Xmx settings set to 
12G . The JVM memory settings are the following :

-Xms12288m -Xmx12288m -XX:MaxMetaspaceSize=512m -XX:MaxPermSize=512M 
-XX:ReservedCodeCacheSize=96m

We have to assume, the repository size is huge (but unknown to me at that 
moment).


-Where does the Lucene copyonread feature use the memory from ? 
off-heap memory of JVM allocated memory ?

-Are there additional memory settings to increase for this specific 
feature ? Or one of the above seems unsufficient ?

Thanks,

09.10.2015 09:52:42.439 *ERROR* [pool-5-thread-28] 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker Failed to open 
Lucene index at /oak:index/lucene
java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:907)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
at 
org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:228)
at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:195)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnReadDirectory$FileReference.openLocalInput(IndexCopier.java:382)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier$CopyOnReadDirectory.openInput(IndexCopier.java:227)
at 
org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsReader.(Lucene40StoredFieldsReader.java:82)
at 
org.apache.lucene.codecs.lucene40.Lucene40StoredFieldsFormat.fieldsReader(Lucene40StoredFieldsFormat.java:91)
at 
org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:129)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:96)
at 
org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:62)
at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:843)
at 
org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexNode.(IndexNode.java:94)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexNode.open(IndexNode.java:62)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker$1.leave(IndexTracker.java:98)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:153)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:487)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord.compareBranch(MapRecord.java:565)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:470)
at 
org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583)
at 
org.apache.jackrabbit.oak.spi.commit.EditorDiff.childNodeChanged(EditorDiff.java:148)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord$3.childNodeChanged(MapRecord.java:444)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:487)
at 
org.apache.jackrabbit.oak.plugins.segment.MapRecord.compare(MapRecord.java:436)
at 
org.apache.jackrabbit.oak.plugins.segment.SegmentNodeState.compareAgainstBaseState(SegmentNodeState.java:583)
at org.apache.jackrabbit.oak.spi.commit.EditorDiff.process(EditorDiff.java:52)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexTracker.update(IndexTracker.java:108)
at 
org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexProvider.contentChanged(LuceneIndexProvider.java:69)
at 
org.apache.jackrabbit.oak.spi.commit.BackgroundObserver$1$1.call(BackgroundObserver.java:125)
at 
org.apache.jackrabbit.oak.spi.commit.BackgroundObserver$1$1.call(BackgroundObserver.java:119)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:904)
... 35 common frames omitted
09.10.2015 09:52:42.439 *WARN* [pool-5-thread-70] 
org.apache.jackrabbit.oak.plugins.index.lucene.IndexCopier Error occurred 

Re: Jackrabbit OAK property index never used ?

2015-09-24 Thread Thomas Mueller
Hi,

OAK-2852 is fixed now, this will be in the next release.


If you don't want to wait for that, I suggest to reindex the counter
index, that is, set "reindex" on that index to "true" (and save).

Regards,
Thomas

On 23/09/15 16:55, "Sebastien Berthezene" <sberthez@gmail.com> wrote:

>Thanks for your help but I have the node "/oak:index/counter". Dump of
>node
>get with "session.getNodeByIdentifier("/oak:index/counter")" is the
>following :
>type = counter
>async = async
>jcr:primaryType = oak:QueryIndexDefinition
>
>From what i understand it seems that node count is not set (is it supposed
>to be set somewhere ?) and so traversal index use a default hardcoded
>value
>corresponding to ApproximateCounter.COUNT_RESOLUTION * 20 ( = 2000) always
>lower than node count managed by my property index. How could i fix this
>node count supposed to be stored into a ":count" property ?
>
>Regards
>
>2015-09-23 12:15 GMT+02:00 Thomas Mueller <muel...@adobe.com>:
>
>> Hi,
>>
>> I think you are hitting OAK-2852.
>>
>> Regards,
>> Thomas
>>
>>
>>
>> On 23/09/15 11:42, "Thomas Mueller" <muel...@adobe.com> wrote:
>>
>> >Hi,
>> >
>> >Do you have a node called /oak:index/counter ? Out of the-box, it
>>should
>> >be there (with a recent version of Oak). That is the approximate
>>counter
>> >index that is used to estimate how many nodes to traverse. As a
>> >workaround, you probably have to re-index that one manually. I wonder
>>why
>> >that index is not updated in your case, it is a regular asynchronous
>> >index.
>> >
>> >Regards,
>> >Thomas
>> >
>> >
>> >On 22/09/15 17:04, "Sebastien Berthezene" <sberthez@gmail.com>
>>wrote:
>> >
>> >>Because 2000 (not corresponding to reality) is lower than 12702 from
>>my
>> >>property index, the traversal mode is used.
>> >
>>
>>



Re: svn commit: r1704844 - in /jackrabbit/oak/trunk: oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/index/counter/jmx/ oak-core/src/main/java/org/apache/jackrabbit/oak/query/ oak-core/src/ma

2015-09-24 Thread Thomas Mueller
Hi,

Yes, makes sense... what about getIndexCostInfo?

Regards,
Thomas


On 24/09/15 07:41, "Chetan Mehrotra"  wrote:

>Hi Thomas,
>
>On Wed, Sep 23, 2015 at 6:51 PM,   wrote:
>>  /**
>> + * Get the index cost. The query must already be prepared.
>> + *
>> + * @return the index cost
>> + */
>> +String getIndexCost();
>
>Should this be returning string? May be we should name it better
>
>Chetan Mehrotra



Re: Jackrabbit OAK property index never used ?

2015-09-23 Thread Thomas Mueller
Hi,

I think you are hitting OAK-2852.

Regards,
Thomas



On 23/09/15 11:42, "Thomas Mueller" <muel...@adobe.com> wrote:

>Hi,
>
>Do you have a node called /oak:index/counter ? Out of the-box, it should
>be there (with a recent version of Oak). That is the approximate counter
>index that is used to estimate how many nodes to traverse. As a
>workaround, you probably have to re-index that one manually. I wonder why
>that index is not updated in your case, it is a regular asynchronous
>index.
>
>Regards,
>Thomas
>
>
>On 22/09/15 17:04, "Sebastien Berthezene" <sberthez@gmail.com> wrote:
>
>>Because 2000 (not corresponding to reality) is lower than 12702 from my
>>property index, the traversal mode is used.
>



Re: Jackrabbit OAK property index never used ?

2015-09-23 Thread Thomas Mueller
Hi,

Do you have a node called /oak:index/counter ? Out of the-box, it should
be there (with a recent version of Oak). That is the approximate counter
index that is used to estimate how many nodes to traverse. As a
workaround, you probably have to re-index that one manually. I wonder why
that index is not updated in your case, it is a regular asynchronous index.

Regards,
Thomas


On 22/09/15 17:04, "Sebastien Berthezene"  wrote:

>Because 2000 (not corresponding to reality) is lower than 12702 from my
>property index, the traversal mode is used.



Re: Question about the Oak Query Index

2015-09-22 Thread Thomas Mueller
Hi,

I will change the documentation to "Oak does not index _as_much_ content
by default as does Jackrabbit 2".

Regards,
Thomas



On 21/09/15 10:11, "Michael Lemler"  wrote:

>Oak 
>does not index content by default as does Jackrabbit 2



Re: Jackrabbit OAK property index never used ?

2015-09-22 Thread Thomas Mueller
Hi,

Which version of Oak do you use?

Could you get the estimated node count for the root node, and for this
index? To get that, for example use the NodeCounter JMX bean
(NodeCounterMBean), getEstimatedChildNodeCounts("/", 2) and
getEstimatedChildNodeCounts("/oak:index", 3).

Regards,
Thomas


On 22/09/15 12:51, "Sebastien Berthezene"  wrote:

>I am trying to use property index with Jackrabbit but when i have many
>thousands of nodes it seems that transversal mode is always chosen for
>query execution.
>
>For example, i have 10 000 nodes of type test:mytype under a single node
>/mystore. Into each of these nodes i have a property test:myprop with 3
>different possible values (nearly 3000 nodes for each value).
>
>When i run the following query
>
>select [jcr:uuid] from [test:mytype] where [test:myprop]='MyValue'
>
>the query engine processor always use the transversing mode and do not use
>the index i have created for test:myprop.
>
>I tried to debug the code directly, i clearly see the query engine trying
>to use the property index i have defined but do not use it because cost of
>transversing cursor index seems to be always "100" and my property index
>contains nearly "3000" nodes for each possible value. Query engine
>consider
>that using transversal cursor will be more efficient, even if engine will
>need to transverse 1 nodes !
>
>Did someone already faced similar problem ?
>
>Regards



Re: Repo Inconsistencies due to OAK-3169

2015-09-01 Thread Thomas Mueller
Hi,

Could someone please update OAK-3169 with a link to the new issue, or the
resolution?

Regards,
Thomas

On 24/08/15 08:46, "Davide Giannella"  wrote:

>On 22/08/2015 19:59, Manfred Baedke wrote:
>> Hi,
>>
>> OAK-3169 caused inconsistencies that currently have to be repaired
>> manually, even after a patch has been applied. Since lots of customers
>> are suffering from this, Andrew Khoury suggested to implement an
>> optional auto-repair feature, which logs a warning and removes and
>> re-adds mixin:versionable when a broken version history is found
>> (losing the version history, of course).
>> One question would be where to put the repair code, because it's
>> unclear to me if there might be multiple locations in the Oak code
>> where an exception might be thrown due to the inconsistency.
>> Any thoughts?
>
>I'm not familiar with the issue but as far as I can read this applies to
>all our persistence. And it's pretty core to be in oak-core. On the
>other hand if rep:versionablePath is only related to JCR and we can deal
>with the fix from a JCR point of view, ie without accessing the
>NodeStore layer, I would say it fits better in oak-jcr.
>
>if instead you're thinking of a tool to be run one-off for fixing the
>situation I would suggest either a groovy script for the oak console or,
>yet, another parameter in oak-run.
>
>When/how do you detect a broken node?
>
>Davide
>
>



Re: [VOTE] Release Apache Jackrabbit Oak 1.3.5

2015-09-01 Thread Thomas Mueller
[X] +1 Release this package as Apache Jackrabbit Oak 1.3.5


On 01/09/15 15:49, "Julian Reschke"  wrote:

>On 2015-09-01 15:30, Davide Giannella wrote:
>> ...
>>  [ ] +1 Release this package as Apache Jackrabbit Oak 1.3.5
>>  [ ] -1 Do not release this package because...
>> ...
>
>  [X] +1 Release this package as Apache Jackrabbit Oak 1.3.5



Using of final for variables and parameters

2015-08-28 Thread Thomas Mueller
Hi,

I wonder what does the team think about using final for variables and 
parameters. In Oak, so far we didn't use it a lot. This question has come up 
with OAK-3148. The patch uses final for variables, but not for parameters. 
Lately, I have seens some code (I forgot where) that uses final for 
parameters in an _interface_. Please note this is not about using final for 
fields, where I think everybody agrees it should be used. It's just about 
variables and parameters.

I think we have 3 options:

(a) use final for variables and parameters everywhere possible

(b) let the individual developer decide

(c) don't use it except if needed

Some links:
http://stackoverflow.com/questions/154314/when-should-one-use-final-for-method-parameters-and-local-variables
http://stackoverflow.com/questions/137868/using-final-modifier-whenever-applicable-in-java

Personally, my favorite is (c), followed by (b). As for (a), I think (same as 
Alex Miller at StackOverflow) it clutters the code and decreases readability. 
Too bad final is not the default in Java, but Java will not change, and we 
are stuck with Java. I think using final will not improve the code, because 
people don't accidentally change variables and parameters, so it will not 
help the writer of a method, it will not help the compiler or performance (the 
JVM can easily see if a variable or parameter is effectively assigned only 
once). To improve the code, I'm all for using Checkstyle, unit tests, code 
coverage, mutation testing, enforcing to write Javadocs for interfaces, and so 
on. But using final wherever possible, I think it would be a step backwards.

Regards,
Thomas



Re: Using of final for variables and parameters

2015-08-28 Thread Thomas Mueller
Hi,

- I know that the variable ³state created at the beginning of the method
is the same one I can access at the end.

For short methods, it's easy to see this, by reading the code. In my view
easier than using final, which makes the code less readable.

For large methods,... well you should avoid large methods :-) But sure,
for large methods, using final is OK for me, if you have variables on
the outer level. But for example here:

for (final String id : newBlobIds) {
assertTrue(splitBlobStore.isMigrated(id));
}


it's very easy to see that id is not changed. Or here:

public String getPath() {
final StringBuilder path = new StringBuilder(/);
return Joiner.on('/').appendTo(path, nameQueue).toString();
}


Or here:

while (true) {
final int expectedByte = expected.read();
final int actualByte = actual.read();
assertEquals(expectedByte, actualByte);
if (expectedByte == -1) {
break;
}
}


(I think it¹s some kind of code smell to reassign parameter anyway).

There is a Checkstyle feature to ensure parameters are not reassigned. I'm
fine using it, even if it means we need to change some code.

That¹s why I vote for the (b).

Even thought I prefer (a), I can live with (b), but I would very much
prefer using final only very sparsely.

Regards,
Thomas



SegmentStore: MultiStore Compaction

2015-08-28 Thread Thomas Mueller
Hi,

I thought about SegmentStore compaction and made a few slides:

http://www.slideshare.net/ThomasMueller12/multi-store-compaction

Feedback is welcome! The idea is at quite an early stage, so if you don't 
understand or agree with some items, I'm to blame.

Regards,
Thomas


Re: SegmentStore: MultiStore Compaction

2015-08-28 Thread Thomas Mueller
Hi,

AFAIK your tests re. restart have been done from within an OSGi
container (AEM) restarting the repository bundle.

Actually I restarted the JVM. But I probably didn't use the very latest
version of Oak, checkpoints may also have played a role in my case (there
were 2 checkpoints; probably from Lucene indexing).

I still have access to that repository (to both the pre-compacted and the
compacted version), it should be quite simple to test various ideas. But
of course each test takes a long time (an hour or so).

Regards,
Thomas



Re: SegmentStore: MultiStore Compaction

2015-08-28 Thread Thomas Mueller
Hi,

I'm not an expert on compaction, but the differences the current approach
I see are:

* No compaction maps. No memory problem. No persistent compaction map. As
far as I understand, currently you can have _multiple_ compaction map at
the same time. I think that persisting the compaction map is problematic
for code complexity and performance reasons, specially with large
repositories. As for performance, it depends a lot on how the compaction
maps are stored (randomized access patterns will hurt performance a lot).

* Simpler architecture: Multi-store compaction is implemented on top of
the current SegmentStore, which makes it more modular and easier to
(unit-) test. The current compaction code, on the other hand, is more
interwoven with the regular SegmentStore code.

 you approach most likely suffers from the same problems we currently
have re. contention, performance, in memory references, ...

Contention and performance, what are the problems right now?


Memory references: with a restart, old memory references are gone, so the
old segment store can be removed fully, without risk. Right now, at least
with the version of Oak I have tested, 1.2.3 I think, running online
compaction multiple times, each time with a restart, did not shrink the
repository (size is 3 times the size of a fully compacted repo, with very
little writes). Without restart, access to very old objects can result in
a easy to understand exception message.


Regards,
Thomas




On 28/08/15 13:57, Michael Dürig mdue...@apache.org wrote:


AFAIU this is pretty much what we are now doing under the hood. That is,
your proposal would make the compaction step more explicit and visible
above the node store API.
An advantage of your approach is preventing mix segments all together
(i.e. compacted segments still referring to uncompacted ones). This is
something we were having problems with in the past, which I however
believe we have solved by now.
However, you approach most likely suffers from the same problems we
currently have re. contention, performance, in memory references, ...

Michael



On 28.8.15 10:05 , Thomas Mueller wrote:
 Hi,

 I thought about SegmentStore compaction and made a few slides:

 http://www.slideshare.net/ThomasMueller12/multi-store-compaction

 Feedback is welcome! The idea is at quite an early stage, so if you
don't understand or agree with some items, I'm to blame.

 Regards,
 Thomas




Re: Is the hashing of long paths still needed?

2015-08-21 Thread Thomas Mueller
Hi,

The DocumentStore doesn't really know the path, it only knows the key, and
if the key is hashed you can't calculate the path.

There are some options:

(a) Each document that has a hashed path as the key also has a path
property (with the real path). You could use that (cache it, read it if
needed, possibly from all backends).

(b) Change the DocumentStore API: add the path in addition to the key.
This is quite some work, and errors could be introduced here (the wrong
path is passed and so on).

Regards,
Thomas

On 21/08/15 11:57, Bertrand Delacretaz bdelacre...@apache.org wrote:

Hi,

Continuing to play with Robert's MultiplexingDocumentStore [1] I got a
failure in the oak-jcr module's LongPathTest.

That's due to the conversion of long paths to their hashed variants -
those cannot be used to locate the appropriate DocumentStore when
multiplexed, as that decision is based on the real path. In some cases
like UpdateOp the real path is saved alongside the hashed one, but not
always.

Removing the path hashing as shown in the below patch removes that
problem, and all oak-core tests pass with this change.

Is that hashing still needed, or was that created due to backend
limitations which are gone now?

If still needed, could it be made configurable so that users can make
their own tradeoff (MultiplexingDocumentStore vs. supporting backends
that require that), or are there other constraints?

-Bertrand

[1] https://github.com/bdelacretaz/jackrabbit-oak/tree/bertrand-multiplex



diff --git 
a/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/U
tils.java
b/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/U
tils.java
index 4577c4b..d97ae76 100644
--- 
a/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/U
tils.java
+++ 
b/oak-core/src/main/java/org/apache/jackrabbit/oak/plugins/document/util/U
tils.java
@@ -254,6 +254,7 @@ public class Utils {
 }

 public static String getIdFromPath(String path) {
+/*
 if (isLongPath(path)) {
 MessageDigest digest;
 try {
@@ -267,6 +268,7 @@ public class Utils {
 String name = PathUtils.getName(path);
 return depth + :h + Hex.encodeHexString(hash) + / + name;
 }
+*/
 int depth = Utils.pathDepth(path);
 return depth + : + path;
 }
diff --git 
a/oak-core/src/test/java/org/apache/jackrabbit/oak/plugins/document/util/U
tilsTest.java
b/oak-core/src/test/java/org/apache/jackrabbit/oak/plugins/document/util/U
tilsTest.java
index b9f619e..5ed1a95 100644
--- 
a/oak-core/src/test/java/org/apache/jackrabbit/oak/plugins/document/util/U
tilsTest.java
+++ 
b/oak-core/src/test/java/org/apache/jackrabbit/oak/plugins/document/util/U
tilsTest.java
@@ -32,6 +32,7 @@ import org.junit.Test;

 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertNull;
+import static org.junit.Assert.assertNotNull;
 import static org.junit.Assert.assertSame;
 import static org.junit.Assert.assertTrue;

@@ -62,7 +63,8 @@ public class UtilsTest {
 String longPath = PathUtils.concat(/+Strings.repeat(p,
Utils.PATH_LONG + 1), foo);
 assertTrue(Utils.isLongPath(longPath));

-assertNull(Utils.getParentId(Utils.getIdFromPath(longPath)));
+// updated to match the changes to Utils.getIdFromPath
+assertNotNull(Utils.getParentId(Utils.getIdFromPath(longPath)));

 assertNull(Utils.getParentId(Utils.getIdFromPath(/)));
 assertEquals(1:/foo,Utils.getParentId(2:/foo/bar));



Re: Modularization

2015-08-07 Thread Thomas Mueller
Hi,

I have nothing against modularization, I'm just against modularization =
create many many Maven projects. I prefer modularization *within* one
project. Why can't we do that instead?

Ideally you have a ³root² project, e.g.

/oak
  /security
/api
/implementationA
/implementationB
  /core
  /persistence
  /..

That looks like a Java *package* structure to me. The Wikipedia article
you mentioned is not about Maven projects, but about modularity in general.

Regards,
Thomas




  1   2   3   4   5   >