date:20160606


[ 
https://issues.apache.org/jira/browse/OAK-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317233#comment-15317233
 ] 

Michael Dürig commented on OAK-4420:


One way would be by rebasing the diffs between the checkpoints of the source 
repository onto the checkpoints of the target repository. This would probably 
work best if the copy follows the order the checkpoints where initially created 
rebasing the root last (on top of the most recent checkpoint).

> RepositorySidegrade: oak-segment to oak-segment-tar should migrate checkpoint 
> info
> --
>
> Key: OAK-4420
> URL: https://issues.apache.org/jira/browse/OAK-4420
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: segment-tar, upgrade
>Reporter: Alex Parvulescu
>Assignee: Tomek Rękawek
> Attachments: OAK-4420-naive.patch
>
>
> The sidegrade from {{oak-segment}} to {{oak-segment-tar}} should also take 
> care of moving the checkpoint data and meta. This will save a very expensive 
> full-reindex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (OAK-4315) DefaultSyncHandler shouldn't apply automatic membership on existing users.


 [ 
https://issues.apache.org/jira/browse/OAK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela resolved OAK-4315.
-
Resolution: Won't Fix

[~baedke], feel free to reopen.

> DefaultSyncHandler shouldn't apply automatic membership on existing users.
> --
>
> Key: OAK-4315
> URL: https://issues.apache.org/jira/browse/OAK-4315
> Project: Jackrabbit Oak
>  Issue Type: Wish
>  Components: auth-external
>Affects Versions: 1.0.30, 1.2.14, 1.5.1
>Reporter: Manfred Baedke
>Assignee: Manfred Baedke
>Priority: Minor
>
> The DefaultSyncHandler applies automatic group membership on every user sync. 
> It should only be applied when a new user has been created by the sync 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4161) DefaultSyncHandler should avoid concurrent synchronization of the same user


[ 
https://issues.apache.org/jira/browse/OAK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316661#comment-15316661
 ] 

angela commented on OAK-4161:
-

[~baedke], maybe illustrating your report with a benchmark that highlights the 
issue would be a good thing to start with?

> DefaultSyncHandler should avoid concurrent synchronization of the same user
> ---
>
> Key: OAK-4161
> URL: https://issues.apache.org/jira/browse/OAK-4161
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: auth-external
>Reporter: Manfred Baedke
>Assignee: Manfred Baedke
>
> Concurrent synchronization of the same user may have a significant 
> performance impact on systems where user sync is already a bottleneck.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4379) Batch mode for SyncMBeanImpl


 [ 
https://issues.apache.org/jira/browse/OAK-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela updated OAK-4379:

Fix Version/s: (was: 1.5.4)
   1.5.3

> Batch mode for SyncMBeanImpl
> 
>
> Key: OAK-4379
> URL: https://issues.apache.org/jira/browse/OAK-4379
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: auth-external
>Reporter: angela
>Assignee: angela
>  Labels: performance
> Fix For: 1.5.3
>
> Attachments: OAK-4379.patch
>
>
> the {{SyncMBeanImpl}} currently calls {{Session.save()}} for every single 
> sync, which IMO make the synchronization methods extra expensive.
> IMHO we should consider introducing a batch mode that reduces the number of 
> save calls. the drawback of this was that the complete set of sync-calls 
> withing a given batch would succeed or fail. in case of failure the 
> 'original' sync-result would need to be replaced by one with operation status 
> 'ERR'.
> now that we have the basis for running benchmarks for the {{SyncMBeanImpl}}, 
> we should be able to verify if this proposal actually has a positive impact 
> (though benchmark results from OAK-4119 and OAK-4120 seem to indicate that 
> this is the case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4399) Benchmark results for dynamic membership


 [ 
https://issues.apache.org/jira/browse/OAK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela updated OAK-4399:

Fix Version/s: (was: 1.5.4)
   1.5.3

> Benchmark results for dynamic membership
> 
>
> Key: OAK-4399
> URL: https://issues.apache.org/jira/browse/OAK-4399
> Project: Jackrabbit Oak
>  Issue Type: Task
>  Components: auth-external
>Reporter: angela
>Assignee: angela
> Fix For: 1.5.3
>
> Attachments: 
> ExternalAuthentication_ExternalLogin_dynamic_20160531_112853.csv, 
> ExternalAuthentication_ExternalLogin_fullsync_20160531_115955.csv, 
> ExternalAuthentication_SyncAllExternalUsersTest_dynamic_20160531_170205.csv, 
> ExternalAuthentication_SyncAllExternalUsersTest_fullsync_20160531_122006.csv
>
>
> task to document results of benchmarks comparing performance of the dynamic 
> membership improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4383) Benchmarks tests for oak-auth-external


 [ 
https://issues.apache.org/jira/browse/OAK-4383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela updated OAK-4383:

Fix Version/s: (was: 1.5.4)
   1.5.3

> Benchmarks tests for oak-auth-external
> --
>
> Key: OAK-4383
> URL: https://issues.apache.org/jira/browse/OAK-4383
> Project: Jackrabbit Oak
>  Issue Type: Epic
>  Components: auth-external, run
>Reporter: angela
> Fix For: 1.5.3
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4087) Replace Sync of configured AutoMembership by Dynamic Principal Generation


 [ 
https://issues.apache.org/jira/browse/OAK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela updated OAK-4087:

Fix Version/s: (was: 1.5.4)
   1.5.3

> Replace Sync of configured AutoMembership by Dynamic Principal Generation
> -
>
> Key: OAK-4087
> URL: https://issues.apache.org/jira/browse/OAK-4087
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: auth-external
>Reporter: angela
>Assignee: angela
>  Labels: performance
> Fix For: 1.5.3
>
> Attachments: OAK-4087.patch, OAK-4087_documentation.patch
>
>
> the {{DefaultSyncConfig}} comes with a configuration option 
> {{PARAM_USER_AUTO_MEMBERSHIP}} indicating the set of groups a given external 
> user must always become member of upon sync into the repository.
> this results in groups containing almost all users in the system (at least 
> those synchronized form the external IDP). while this behavior is straight 
> forward (and corresponds to the behavior in the previous crx version), it 
> wouldn't be necessary from a repository point of view as a given {{Subject}} 
> can be populated from different principal sources and dealing with this kind 
> of dynamic-auto-membership was a typical use-case.
> what does that mean:
> instead of performing the automembership on the user management, the external 
> authentication setup could come with an auto-membership {{PrincipalProvider}} 
> implementation that would expose the desired group membership for all 
> external principals (assuming that they were identified as such).
> [~tripod], do you remember if that was ever an option while building the 
> {{oak-auth-external}} module? if not, could that be worth a second thought 
> also in the light of OAK-3933?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4419) Benchmark Results for SyncMBeanImpl with Batch Mode


 [ 
https://issues.apache.org/jira/browse/OAK-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela updated OAK-4419:

Fix Version/s: (was: 1.5.4)
   1.5.3

> Benchmark Results for SyncMBeanImpl with Batch Mode
> ---
>
> Key: OAK-4419
> URL: https://issues.apache.org/jira/browse/OAK-4419
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: auth-external
>Reporter: angela
>Assignee: angela
> Fix For: 1.5.3
>
> Attachments: 
> ExternalAuthentication_SyncAllExternalUsers_batch1000_dynamic_20160602_083836.csv,
>  
> ExternalAuthentication_SyncAllExternalUsers_batch1000_fullsync_20160602_090830.csv,
>  
> ExternalAuthentication_SyncAllExternalUsers_batch100_dynamic_20160601_210703.csv,
>  
> ExternalAuthentication_SyncAllExternalUsers_batch100_fullsync_20160601_171145.csv,
>  
> ExternalAuthentication_SyncAllExternalUsers_batch10_dynamic_20160602_171000.csv,
>  
> ExternalAuthentication_SyncAllExternalUsers_batch10_fullsync_20160602_145511.csv
>
>
> Verify effect of OAK-4379 using {{SyncAllExternalUsersTest}} benchmark as 
> present with oak-run and try to identify if there exists an optimal batch 
> size that is suited to be used as default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed


 [ 
https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davide Giannella updated OAK-4430:
--
Fix Version/s: (was: 1.5.3)
   1.6

> DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
> 
>
> Key: OAK-4430
> URL: https://issues.apache.org/jira/browse/OAK-4430
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>  Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.6
>
>
> DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the 
> lastModifiedTime criteria is satisfied against the given 
> {{maxLastModifiedTime}}. 
> When the {{maxLastModifiedTime}} has a value 0 it  effectively means ignore 
> any last modified time check (and which is the only usage currently from 
> MarkSweepGarbageCollector). This should ignore fetching the DataRecords as 
> this can be very expensive for e.g on calls to S3 with millions of blobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4432) Ignore files in the root directory of the FileDataStore in #getAllIdentifiers


 [ 
https://issues.apache.org/jira/browse/OAK-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davide Giannella updated OAK-4432:
--
Fix Version/s: (was: 1.5.3)
   1.6

> Ignore files in the root directory of the FileDataStore in #getAllIdentifiers
> -
>
> Key: OAK-4432
> URL: https://issues.apache.org/jira/browse/OAK-4432
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>Priority: Minor
>  Labels: candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.6
>
>
> The call to OakFileDataStore#getAllIdentifiers should ignore the the files 
> directly at the root of the DataStore (These files are used for 
> SharedDataStore etc). This does not cause any functional problems but leads 
> to logging warning in the logs. 
> There is already a check but it fails when the data store root is specified 
> as a relative path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4429) [oak-blob-cloud] S3Backend#getAllIdentifiers should not store all elements in memory


 [ 
https://issues.apache.org/jira/browse/OAK-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davide Giannella updated OAK-4429:
--
Fix Version/s: (was: 1.5.3)
   1.6

> [oak-blob-cloud] S3Backend#getAllIdentifiers should not store all elements in 
> memory
> 
>
> Key: OAK-4429
> URL: https://issues.apache.org/jira/browse/OAK-4429
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>  Labels: candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.6
>
>
> While fetching all blob ids from S3 the data is stored on memory before 
> returning an iterator over it. This can be problematic when the number of 
> blobs stored S3 are in millions.
> The code should be changed to store the elements on a temp file and then 
> returning an iterator over it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (OAK-4433) Release Oak 1.5.3

Davide Giannella created OAK-4433:
-

 Summary: Release Oak 1.5.3
 Key: OAK-4433
 URL: https://issues.apache.org/jira/browse/OAK-4433
 Project: Jackrabbit Oak
  Issue Type: Task
Reporter: Davide Giannella
Assignee: Davide Giannella
 Fix For: 1.6






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (OAK-4379) Batch mode for SyncMBeanImpl


 [ 
https://issues.apache.org/jira/browse/OAK-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angela resolved OAK-4379.
-
   Resolution: Fixed
Fix Version/s: 1.5.4

> Batch mode for SyncMBeanImpl
> 
>
> Key: OAK-4379
> URL: https://issues.apache.org/jira/browse/OAK-4379
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: auth-external
>Reporter: angela
>Assignee: angela
>  Labels: performance
> Fix For: 1.5.4
>
> Attachments: OAK-4379.patch
>
>
> the {{SyncMBeanImpl}} currently calls {{Session.save()}} for every single 
> sync, which IMO make the synchronization methods extra expensive.
> IMHO we should consider introducing a batch mode that reduces the number of 
> save calls. the drawback of this was that the complete set of sync-calls 
> withing a given batch would succeed or fail. in case of failure the 
> 'original' sync-result would need to be replaced by one with operation status 
> 'ERR'.
> now that we have the basis for running benchmarks for the {{SyncMBeanImpl}}, 
> we should be able to verify if this proposal actually has a positive impact 
> (though benchmark results from OAK-4119 and OAK-4120 seem to indicate that 
> this is the case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4426) RepositorySidegrade: oak-segment to oak-segment-tar should drop the name length check


 [ 
https://issues.apache.org/jira/browse/OAK-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-4426:
---
Fix Version/s: (was: 1.5.4)
   1.5.3

> RepositorySidegrade: oak-segment to oak-segment-tar should drop the name 
> length check
> -
>
> Key: OAK-4426
> URL: https://issues.apache.org/jira/browse/OAK-4426
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: segment-tar, upgrade
>Reporter: Alex Parvulescu
>Assignee: Tomek Rękawek
> Fix For: 1.6, 1.5.3
>
>
> As mentioned on OAK-4260, this name length verification is causing some data 
> to be dropped from the upgrade. This limitation only affects mongomk 
> deployments, so it should not apply here.
> {code}
> *WARN*  org.apache.jackrabbit.oak.upgrade.nodestate.NameFilteringNodeState - 
> Node name 'node-name' too long.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (OAK-4102) Break cyclic dependency of FileStore and SegmentTracker

2016-06-06 Thread Francesco Mari (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316418#comment-15316418
 ] 

Francesco Mari edited comment on OAK-4102 at 6/6/16 12:31 PM:
--

The work for OAK-4373 made very easy to find a (potential) solution for this 
issue. With [^OAK-4102-01.patch] I propose the following changes:

# Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb 
pair.
# Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}.
# In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an 
insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. 
The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} 
instance.
# {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create 
new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} 
from the {{SegmentTracker}}, which is more and more an implementation detail of 
the {{SegmentStore}}.

I run every unit/integration test locally, and I didn't see any failure. The 
patch still needs some work, especially regarding documentation.  [~mduerig], 
what do you think about this?


was (Author: frm):
The work for OAK-4373 made very easy to find a (potential) solution for this 
issue. With [^OAK-4102-01.patch] I propose the following changes:
# Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb 
pair.
# Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}.
# In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an 
insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. 
The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} 
instance.
# {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create 
new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} 
from the {{SegmentTracker}}, which is more and more an implementation detail of 
the {{SegmentStore}}.
I run every unit/integration test locally, and I didn't see any failure. The 
patch still needs some work, especially regarding documentation.  [~mduerig], 
what do you think about this?

> Break cyclic dependency of FileStore and SegmentTracker
> ---
>
> Key: OAK-4102
> URL: https://issues.apache.org/jira/browse/OAK-4102
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: segment-tar
>Reporter: Michael Dürig
>Assignee: Francesco Mari
>  Labels: technical_debt
> Fix For: 1.6
>
> Attachments: OAK-4102-01.patch
>
>
> {{SegmentTracker}} and {{FileStore}} are mutually dependent on each other. 
> This is problematic and makes initialising instances of these classes 
> difficult: the {{FileStore}} constructor e.g. passes a not fully initialised 
> instance to the {{SegmentTracker}}, which in turn writes an initial node 
> state to the store. Notably using the not fully initialised {{FileStore}} 
> instance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4102) Break cyclic dependency of FileStore and SegmentTracker

2016-06-06 Thread Francesco Mari (JIRA)


 [ 
https://issues.apache.org/jira/browse/OAK-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francesco Mari updated OAK-4102:

Attachment: OAK-4102-01.patch

The work for OAK-4373 made very easy to find a (potential) solution for this 
issue. With [^OAK-4102-01.patch] I propose the following changes:
# Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb 
pair.
# Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}.
# In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an 
insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. 
The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} 
instance.
# {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create 
new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} 
from the {{SegmentTracker}}, which is more and more an implementation detail of 
the {{SegmentStore}}.
I run every unit/integration test locally, and I didn't see any failure. The 
patch still needs some work, especially regarding documentation.  [~mduerig], 
what do you think about this?

> Break cyclic dependency of FileStore and SegmentTracker
> ---
>
> Key: OAK-4102
> URL: https://issues.apache.org/jira/browse/OAK-4102
> Project: Jackrabbit Oak
>  Issue Type: Technical task
>  Components: segment-tar
>Reporter: Michael Dürig
>Assignee: Francesco Mari
>  Labels: technical_debt
> Fix For: 1.6
>
> Attachments: OAK-4102-01.patch
>
>
> {{SegmentTracker}} and {{FileStore}} are mutually dependent on each other. 
> This is problematic and makes initialising instances of these classes 
> difficult: the {{FileStore}} constructor e.g. passes a not fully initialised 
> instance to the {{SegmentTracker}}, which in turn writes an initial node 
> state to the store. Notably using the not fully initialised {{FileStore}} 
> instance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads

[
https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tomek Rękawek updated OAK-3865:
---
Description:
*Introduction*

In the current trunk we'll only read document _D_ from the secondary instance
if:
(1) we have the parent _P_ of document _D_ cached and
(2) the parent hasn't been modified in 6 hours.

The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica
stats. It was unreliable, so the second approach was to read the last revisions
directly from each Mongo instance. If the modification date of _P_ is before
last revisions on all secondary Mongos, then secondary can be used.

The main problem with this approach is that we still need to have the _P_ to be
in cache. I think we need another way to optimise the secondary reading, as
right now only about 3% of requests connects to the secondary, which is bad
especially for the global-clustering case (Mongo and Oak instances across the
globe). The optimisation provided in OAK-2106 doesn't make the things much
better and may introduce some consistency issues.

*Proposal - tldr version*

Oak will remember the last revision it has ever seen. In the same time, it'll
query each secondary Mongo instance, asking what's the available stored root
revision. If all secondary instances have a root revision >= last revision seen
by a given Oak instance, it's safe to use the secondary read preference.

*Proposal*

I had following constraints in mind preparing this:
1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and
_R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from
revision _R2_ then reading from a secondary shouldn't result in getting older
revision (eg. _R1_).
2. If an Oak instance modifies a document, then reading from a secondary
shouldn't result in getting the old version (before modification).

So, let's have two maps:
* _M1_ the most recent document revision read from the Mongo for each cluster
id,
* _M2_ the oldest last rev value for root document for each cluster id read
from all the secondary instances.

Maintaining _M1_:
For every read from the Mongo we'll check if the lastRev for some cluster id is
newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add the
saved revision id with the current cluster id in _M1_.

Maintaining _M2_:
It should be periodically updated. Such mechanism is already prepared in the
OAK-2106 patch.

The method deciding whether we can read from the secondary instance should
compare two maps. If all entries in _M2_ are newer than _M1_ it means that the
secondary instances contains at least as new repository state as we already
accessed and therefore it's safe to read from secondary.

Regarding the documents modified by the local Oak instance, we should remember
all the locally-modified paths and their revisions and use primary Mongo to
access them as long as the changes are not replicated to all the secondaries.
When the secondaries are up to date with the modification, we can remove it
from the local-changes collections.

Attached image diagram.png presents the idea.

was:
*Introduction*

In the current trunk we'll only read document _D_ from the secondary instance
if:
(1) we have the parent _P_ of document _D_ cached and
(2) the parent hasn't been modified in 6 hours.

*Proposal*

Maintaining _M1_:
For every read from the Mongo we'll check if the lastRev for some cluster id is
newer than _M1_ entry. If so, we'll update _M1_. For all wr

[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads


 [ 
https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-3865:
---
Attachment: (was: OAK-3865.patch)

> New strategy to optimize secondary reads
> 
>
> Key: OAK-3865
> URL: https://issues.apache.org/jira/browse/OAK-3865
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: mongomk
>Reporter: Tomek Rękawek
>  Labels: performance
> Fix For: 1.6
>
> Attachments: OAK-3865.patch, diagram.png
>
>
> *Introduction*
> In the current trunk we'll only read document _D_ from the secondary instance 
> if:
> (1) we have the parent _P_ of document _D_ cached and
> (2) the parent hasn't been modified in 6 hours.
> The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica 
> stats. It was unreliable, so the second approach was to read the last 
> revisions directly from each Mongo instance. If the modification date of _P_ 
> is before last revisions on all secondary Mongos, then secondary can be used.
> The main problem with this approach is that we still need to have the _P_ to 
> be in cache. I think we need another way to optimise the secondary reading, 
> as right now only about 3% of requests connects to the secondary, which is 
> bad especially for the global-clustering case (Mongo and Oak instances across 
> the globe). The optimisation provided in OAK-2106 doesn't make the things 
> much better and may introduce some consistency issues.
> *Proposal*
> I had following constraints in mind preparing this:
> 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and 
> _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from 
> revision _R2_ then reading from a secondary shouldn't result in getting older 
> revision (eg. _R1_).
> 2. If an Oak instance modifies a document, then reading from a secondary 
> shouldn't result in getting the old version (before modification).
> So, let's have two maps:
> * _M1_ the most recent document revision read from the Mongo for each cluster 
> id,
> * _M2_ the oldest last rev value for root document for each cluster id read 
> from all the secondary instances.
> Maintaining _M1_:
> For every read from the Mongo we'll check if the lastRev for some cluster id 
> is newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add 
> the saved revision id with the current cluster id in _M1_.
> Maintaining _M2_:
> It should be periodically updated. Such mechanism is already prepared in the 
> OAK-2106 patch.
> The method deciding whether we can read from the secondary instance should 
> compare two maps. If all entries in _M2_ are newer than _M1_ it means that 
> the secondary instances contains at least as new repository state as we 
> already accessed and therefore it's safe to read from secondary.
> Regarding the documents modified by the local Oak instance, we should 
> remember all the locally-modified paths and their revisions and use primary 
> Mongo to access them as long as the changes are not replicated to all the 
> secondaries. When the secondaries are up to date with the modification, we 
> can remove it from the local-changes collections.
> Attached image diagram.png presents the idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads


 [ 
https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-3865:
---
Attachment: OAK-3865.patch

> New strategy to optimize secondary reads
> 
>
> Key: OAK-3865
> URL: https://issues.apache.org/jira/browse/OAK-3865
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: mongomk
>Reporter: Tomek Rękawek
>  Labels: performance
> Fix For: 1.6
>
> Attachments: OAK-3865.patch, diagram.png
>
>
> *Introduction*
> In the current trunk we'll only read document _D_ from the secondary instance 
> if:
> (1) we have the parent _P_ of document _D_ cached and
> (2) the parent hasn't been modified in 6 hours.
> The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica 
> stats. It was unreliable, so the second approach was to read the last 
> revisions directly from each Mongo instance. If the modification date of _P_ 
> is before last revisions on all secondary Mongos, then secondary can be used.
> The main problem with this approach is that we still need to have the _P_ to 
> be in cache. I think we need another way to optimise the secondary reading, 
> as right now only about 3% of requests connects to the secondary, which is 
> bad especially for the global-clustering case (Mongo and Oak instances across 
> the globe). The optimisation provided in OAK-2106 doesn't make the things 
> much better and may introduce some consistency issues.
> *Proposal*
> I had following constraints in mind preparing this:
> 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and 
> _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from 
> revision _R2_ then reading from a secondary shouldn't result in getting older 
> revision (eg. _R1_).
> 2. If an Oak instance modifies a document, then reading from a secondary 
> shouldn't result in getting the old version (before modification).
> So, let's have two maps:
> * _M1_ the most recent document revision read from the Mongo for each cluster 
> id,
> * _M2_ the oldest last rev value for root document for each cluster id read 
> from all the secondary instances.
> Maintaining _M1_:
> For every read from the Mongo we'll check if the lastRev for some cluster id 
> is newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add 
> the saved revision id with the current cluster id in _M1_.
> Maintaining _M2_:
> It should be periodically updated. Such mechanism is already prepared in the 
> OAK-2106 patch.
> The method deciding whether we can read from the secondary instance should 
> compare two maps. If all entries in _M2_ are newer than _M1_ it means that 
> the secondary instances contains at least as new repository state as we 
> already accessed and therefore it's safe to read from secondary.
> Regarding the documents modified by the local Oak instance, we should 
> remember all the locally-modified paths and their revisions and use primary 
> Mongo to access them as long as the changes are not replicated to all the 
> secondaries. When the secondaries are up to date with the modification, we 
> can remove it from the local-changes collections.
> Attached image diagram.png presents the idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4200) [BlobGC] Improve collection times of blobs available


[ 
https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316407#comment-15316407
 ] 

Amit Jain commented on OAK-4200:


[~tmueller], [~chetanm] Could you please review the changes done @ 
https://github.com/amit-jain/jackrabbit-oak/commit/a0a743467629b6695d1ed1616cf6fc85e2f6610b

> [BlobGC] Improve collection times of blobs available
> 
>
> Key: OAK-4200
> URL: https://issues.apache.org/jira/browse/OAK-4200
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>Reporter: Amit Jain
>Assignee: Amit Jain
> Fix For: 1.5.4
>
>
> The blob collection phase (Identifying all the blobs available in the data 
> store) is quite an expensive part of the whole GC process, taking up a few 
> hours sometimes on large repositories, due to iteration of the sub-folders in 
> the data store.
> In an offline discussion with [~tmueller] and [~chetanm], the idea came up 
> that this phase can be faster if
> *  Blobs ids are tracked when the blobs are added for e.g. in a simple file 
> in the datastore per cluster node.
> * GC then consolidates this file from all the cluster nodes and uses it to 
> get the candidates for GC.
> * This variant of the MarkSweepGC can be triggered  more frequently. It would 
> be ok to miss blob id additions to this file during a crash etc., as these 
> blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered 
> occasionally.
> We also may be able to track other metadata along with the blob ids like 
> paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4412) Lucene-memory property index


[ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316406#comment-15316406
 ] 

Tomek Rękawek commented on OAK-4412:


[~chetanm], thanks for the feedback. I'd be more happy with only using the 
observer as well. My main concern is that observer is informed about the 
changes asynchronously, so it may happen that the user commits() their changes 
and run the JCR query() before the observer event is processed. Isn't this 
possible or likely?

Also, I've followed your advice about the indentation. Thanks, the patch is now 
much smaller.

> Lucene-memory property index
> 
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
> Fix For: 1.6
>
> Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4412) Lucene-memory property index


 [ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-4412:
---
Attachment: (was: OAK-4412.patch)

> Lucene-memory property index
> 
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
> Fix For: 1.6
>
> Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4412) Lucene-memory property index


 [ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomek Rękawek updated OAK-4412:
---
Attachment: OAK-4412.patch

> Lucene-memory property index
> 
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
> Fix For: 1.6
>
> Attachments: OAK-4412.patch, OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4412) Lucene-memory property index

2016-06-06 Thread Vikas Saurabh (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316394#comment-15316394
 ] 

Vikas Saurabh commented on OAK-4412:


Just to clarify - which problem are we trying to solve:
# sometimes, due do delay in async indexing and background read, subsequent 
request from the same user get missing (yet un-indexed/un-bkRead) result?
# some code patterns which currently _expect_ synchronous nature of property 
index (do change -> save -> query -> expect earlier save to show up) won't cope 
well with async nature?

I understand the former problem statement has value and worth solving. But, my 
reading of this issue felt like we are trying to address the latter (code 
expectation requires sync nature BUT prop indices are currently expensive).

I think, due to different expectations (wrt to timings - diff between 2 user 
requests v/s diff between 2 actions in same call stack) we might want to 
discuss the 2 problems separately. It'd great if same solution solves both 
cases - but I think we shouldn't force same solution for both cases.

> Lucene-memory property index
> 
>
> Key: OAK-4412
> URL: https://issues.apache.org/jira/browse/OAK-4412
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: lucene
>Reporter: Tomek Rękawek
>Assignee: Tomek Rękawek
> Fix For: 1.6
>
> Attachments: OAK-4412.patch
>
>
> When running Oak in a cluster, each write operation is expensive. After 
> performing some stress-tests with a geo-distributed Mongo cluster, we've 
> found out that updating property indexes is a large part of the overall 
> traffic.
> The asynchronous index would be an answer here (as the index update won't be 
> made in the client request thread), but the AEM requires the updates to be 
> visible immediately in order to work properly.
> The idea here is to enhance the existing asynchronous Lucene index with a 
> synchronous, locally-stored counterpart that will persist only the data since 
> the last Lucene background reindexing job.
> The new index can be stored in memory or (if necessary) in MMAPed local 
> files. Once the "main" Lucene index is being updated, the local index will be 
> purged.
> Queries will use an union of results from the {{lucene}} and 
> {{lucene-memory}} indexes.
> The {{lucene-memory}} index, as a local stored entity, will be updated using 
> an observer, so it'll get both local and remote changes.
> The original idea has been suggested by [~chetanm] in the discussion for the 
> OAK-4233.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed


[ 
https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316324#comment-15316324
 ] 

Amit Jain commented on OAK-4430:


The method {{DataStoreBlobStore#getAllChunkIds}} also used the DataRecord 
fetched to encode the length in the id. Considering that this method has only 
one consumer i.e. the {{MarkSweepGarbageCollector}}, we could alter this method 
itself to not encode the blob ids with the length and clearly specify in the 
javadocs. Alternately, we could add an overloaded method that returns all raw 
blob ids.
Either way this would require a method which the gc class can use to get a raw 
id given a length encoded id which the "node store referenced blobs"  
collection phase would return.

[~chetanm] wdyt?

> DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
> 
>
> Key: OAK-4430
> URL: https://issues.apache.org/jira/browse/OAK-4430
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>  Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.5.3
>
>
> DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the 
> lastModifiedTime criteria is satisfied against the given 
> {{maxLastModifiedTime}}. 
> When the {{maxLastModifiedTime}} has a value 0 it  effectively means ignore 
> any last modified time check (and which is the only usage currently from 
> MarkSweepGarbageCollector). This should ignore fetching the DataRecords as 
> this can be very expensive for e.g on calls to S3 with millions of blobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed


[ 
https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316317#comment-15316317
 ] 

Amit Jain commented on OAK-4430:


The method {{DataStoreBlobStore#getAllChunkIds}} also used the DataRecord 
fetched to encode the length in the id. Considering that this method has only 
one consumer i.e. the {{MarkSweepGarbageCollector}}, we could alter this method 
itself to not encode the blob ids with the length and clearly specify in the 
javadocs. Alternately, we could add an overloaded method that returns all raw 
blob ids.
Either way this would require a method which the gc class can use to get a raw 
id given a length encoded id which the "node store referenced blobs"  
collection phase would return.

[~chetanm] wdyt?

> DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
> 
>
> Key: OAK-4430
> URL: https://issues.apache.org/jira/browse/OAK-4430
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>  Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.5.3
>
>
> DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the 
> lastModifiedTime criteria is satisfied against the given 
> {{maxLastModifiedTime}}. 
> When the {{maxLastModifiedTime}} has a value 0 it  effectively means ignore 
> any last modified time check (and which is the only usage currently from 
> MarkSweepGarbageCollector). This should ignore fetching the DataRecords as 
> this can be very expensive for e.g on calls to S3 with millions of blobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (OAK-4422) support cluster for FileBlobStore

2016-06-06 Thread Marco Piovesana (JIRA)


[ 
https://issues.apache.org/jira/browse/OAK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316310#comment-15316310
 ] 

Marco Piovesana commented on OAK-4422:
--

Yes we use TAR. We have a websphere cluster with a file system shared between 
all the nodes. In this file system there is the oak node store. This is how we 
instantiate the repository:
{code:title=RepositoryCreation|borderStyle=solid}
BlobStore blobStore = new FileBlobStore(dataStoreFile.getAbsolutePath());
FileStore repositoryStore = 
FileStore.newFileStore(repositoryFile).withBlobStore(blobStore).create();

NodeStore nodeStore = 
SegmentNodeStore.newSegmentNodeStore(repositoryStore).create();

Jcr jcr = new Jcr(nodeStore).with(new InitialContent()).with(new 
SecurityProviderImpl());
Repository repository = jcr.createRepository(); 
{code}

> support cluster for FileBlobStore
> -
>
> Key: OAK-4422
> URL: https://issues.apache.org/jira/browse/OAK-4422
> Project: Jackrabbit Oak
>  Issue Type: Improvement
>  Components: blob
>Affects Versions: 1.4.3
>Reporter: Marco Piovesana
>
> I'm using Oak in a system where the user can store arbitrary large binary 
> files and because of that I thought the best option was to use the 
> FileBlobStore as storage subsystem. 
> Now I need to port this solution on a cluster environment, but i saw that 
> clustering is supported only for Mongo and RDBMS storage systems. Is there 
> any plan to suppor it also for the Blob storage? There's a better option?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (OAK-4432) Ignore files in the root directory of the FileDataStore in #getAllIdentifiers


 [ 
https://issues.apache.org/jira/browse/OAK-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Jain updated OAK-4432:
---
Summary: Ignore files in the root directory of the FileDataStore in 
#getAllIdentifiers  (was: Ignore files in the root directory of the DataStore)

> Ignore files in the root directory of the FileDataStore in #getAllIdentifiers
> -
>
> Key: OAK-4432
> URL: https://issues.apache.org/jira/browse/OAK-4432
> Project: Jackrabbit Oak
>  Issue Type: Bug
>  Components: blob
>Reporter: Amit Jain
>Assignee: Amit Jain
>Priority: Minor
>  Labels: candidate_oak_1_2, candidate_oak_1_4
> Fix For: 1.5.3
>
>
> The call to OakFileDataStore#getAllIdentifiers should ignore the the files 
> directly at the root of the DataStore (These files are used for 
> SharedDataStore etc). This does not cause any functional problems but leads 
> to logging warning in the logs. 
> There is already a check but it fails when the data store root is specified 
> as a relative path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (OAK-4432) Ignore files in the root directory of the DataStore