[jira] [Commented] (OAK-4420) RepositorySidegrade: oak-segment to oak-segment-tar should migrate checkpoint info
[ https://issues.apache.org/jira/browse/OAK-4420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15317233#comment-15317233 ] Michael Dürig commented on OAK-4420: One way would be by rebasing the diffs between the checkpoints of the source repository onto the checkpoints of the target repository. This would probably work best if the copy follows the order the checkpoints where initially created rebasing the root last (on top of the most recent checkpoint). > RepositorySidegrade: oak-segment to oak-segment-tar should migrate checkpoint > info > -- > > Key: OAK-4420 > URL: https://issues.apache.org/jira/browse/OAK-4420 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar, upgrade >Reporter: Alex Parvulescu >Assignee: Tomek Rękawek > Attachments: OAK-4420-naive.patch > > > The sidegrade from {{oak-segment}} to {{oak-segment-tar}} should also take > care of moving the checkpoint data and meta. This will save a very expensive > full-reindex. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (OAK-4315) DefaultSyncHandler shouldn't apply automatic membership on existing users.
[ https://issues.apache.org/jira/browse/OAK-4315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela resolved OAK-4315. - Resolution: Won't Fix [~baedke], feel free to reopen. > DefaultSyncHandler shouldn't apply automatic membership on existing users. > -- > > Key: OAK-4315 > URL: https://issues.apache.org/jira/browse/OAK-4315 > Project: Jackrabbit Oak > Issue Type: Wish > Components: auth-external >Affects Versions: 1.0.30, 1.2.14, 1.5.1 >Reporter: Manfred Baedke >Assignee: Manfred Baedke >Priority: Minor > > The DefaultSyncHandler applies automatic group membership on every user sync. > It should only be applied when a new user has been created by the sync > process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4161) DefaultSyncHandler should avoid concurrent synchronization of the same user
[ https://issues.apache.org/jira/browse/OAK-4161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316661#comment-15316661 ] angela commented on OAK-4161: - [~baedke], maybe illustrating your report with a benchmark that highlights the issue would be a good thing to start with? > DefaultSyncHandler should avoid concurrent synchronization of the same user > --- > > Key: OAK-4161 > URL: https://issues.apache.org/jira/browse/OAK-4161 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: auth-external >Reporter: Manfred Baedke >Assignee: Manfred Baedke > > Concurrent synchronization of the same user may have a significant > performance impact on systems where user sync is already a bottleneck. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4379) Batch mode for SyncMBeanImpl
[ https://issues.apache.org/jira/browse/OAK-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-4379: Fix Version/s: (was: 1.5.4) 1.5.3 > Batch mode for SyncMBeanImpl > > > Key: OAK-4379 > URL: https://issues.apache.org/jira/browse/OAK-4379 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: auth-external >Reporter: angela >Assignee: angela > Labels: performance > Fix For: 1.5.3 > > Attachments: OAK-4379.patch > > > the {{SyncMBeanImpl}} currently calls {{Session.save()}} for every single > sync, which IMO make the synchronization methods extra expensive. > IMHO we should consider introducing a batch mode that reduces the number of > save calls. the drawback of this was that the complete set of sync-calls > withing a given batch would succeed or fail. in case of failure the > 'original' sync-result would need to be replaced by one with operation status > 'ERR'. > now that we have the basis for running benchmarks for the {{SyncMBeanImpl}}, > we should be able to verify if this proposal actually has a positive impact > (though benchmark results from OAK-4119 and OAK-4120 seem to indicate that > this is the case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4399) Benchmark results for dynamic membership
[ https://issues.apache.org/jira/browse/OAK-4399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-4399: Fix Version/s: (was: 1.5.4) 1.5.3 > Benchmark results for dynamic membership > > > Key: OAK-4399 > URL: https://issues.apache.org/jira/browse/OAK-4399 > Project: Jackrabbit Oak > Issue Type: Task > Components: auth-external >Reporter: angela >Assignee: angela > Fix For: 1.5.3 > > Attachments: > ExternalAuthentication_ExternalLogin_dynamic_20160531_112853.csv, > ExternalAuthentication_ExternalLogin_fullsync_20160531_115955.csv, > ExternalAuthentication_SyncAllExternalUsersTest_dynamic_20160531_170205.csv, > ExternalAuthentication_SyncAllExternalUsersTest_fullsync_20160531_122006.csv > > > task to document results of benchmarks comparing performance of the dynamic > membership improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4383) Benchmarks tests for oak-auth-external
[ https://issues.apache.org/jira/browse/OAK-4383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-4383: Fix Version/s: (was: 1.5.4) 1.5.3 > Benchmarks tests for oak-auth-external > -- > > Key: OAK-4383 > URL: https://issues.apache.org/jira/browse/OAK-4383 > Project: Jackrabbit Oak > Issue Type: Epic > Components: auth-external, run >Reporter: angela > Fix For: 1.5.3 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4087) Replace Sync of configured AutoMembership by Dynamic Principal Generation
[ https://issues.apache.org/jira/browse/OAK-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-4087: Fix Version/s: (was: 1.5.4) 1.5.3 > Replace Sync of configured AutoMembership by Dynamic Principal Generation > - > > Key: OAK-4087 > URL: https://issues.apache.org/jira/browse/OAK-4087 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: auth-external >Reporter: angela >Assignee: angela > Labels: performance > Fix For: 1.5.3 > > Attachments: OAK-4087.patch, OAK-4087_documentation.patch > > > the {{DefaultSyncConfig}} comes with a configuration option > {{PARAM_USER_AUTO_MEMBERSHIP}} indicating the set of groups a given external > user must always become member of upon sync into the repository. > this results in groups containing almost all users in the system (at least > those synchronized form the external IDP). while this behavior is straight > forward (and corresponds to the behavior in the previous crx version), it > wouldn't be necessary from a repository point of view as a given {{Subject}} > can be populated from different principal sources and dealing with this kind > of dynamic-auto-membership was a typical use-case. > what does that mean: > instead of performing the automembership on the user management, the external > authentication setup could come with an auto-membership {{PrincipalProvider}} > implementation that would expose the desired group membership for all > external principals (assuming that they were identified as such). > [~tripod], do you remember if that was ever an option while building the > {{oak-auth-external}} module? if not, could that be worth a second thought > also in the light of OAK-3933? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4419) Benchmark Results for SyncMBeanImpl with Batch Mode
[ https://issues.apache.org/jira/browse/OAK-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela updated OAK-4419: Fix Version/s: (was: 1.5.4) 1.5.3 > Benchmark Results for SyncMBeanImpl with Batch Mode > --- > > Key: OAK-4419 > URL: https://issues.apache.org/jira/browse/OAK-4419 > Project: Jackrabbit Oak > Issue Type: Technical task > Components: auth-external >Reporter: angela >Assignee: angela > Fix For: 1.5.3 > > Attachments: > ExternalAuthentication_SyncAllExternalUsers_batch1000_dynamic_20160602_083836.csv, > > ExternalAuthentication_SyncAllExternalUsers_batch1000_fullsync_20160602_090830.csv, > > ExternalAuthentication_SyncAllExternalUsers_batch100_dynamic_20160601_210703.csv, > > ExternalAuthentication_SyncAllExternalUsers_batch100_fullsync_20160601_171145.csv, > > ExternalAuthentication_SyncAllExternalUsers_batch10_dynamic_20160602_171000.csv, > > ExternalAuthentication_SyncAllExternalUsers_batch10_fullsync_20160602_145511.csv > > > Verify effect of OAK-4379 using {{SyncAllExternalUsersTest}} benchmark as > present with oak-run and try to identify if there exists an optimal batch > size that is suited to be used as default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
[ https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davide Giannella updated OAK-4430: -- Fix Version/s: (was: 1.5.3) 1.6 > DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed > > > Key: OAK-4430 > URL: https://issues.apache.org/jira/browse/OAK-4430 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain > Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.6 > > > DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the > lastModifiedTime criteria is satisfied against the given > {{maxLastModifiedTime}}. > When the {{maxLastModifiedTime}} has a value 0 it effectively means ignore > any last modified time check (and which is the only usage currently from > MarkSweepGarbageCollector). This should ignore fetching the DataRecords as > this can be very expensive for e.g on calls to S3 with millions of blobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4432) Ignore files in the root directory of the FileDataStore in #getAllIdentifiers
[ https://issues.apache.org/jira/browse/OAK-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davide Giannella updated OAK-4432: -- Fix Version/s: (was: 1.5.3) 1.6 > Ignore files in the root directory of the FileDataStore in #getAllIdentifiers > - > > Key: OAK-4432 > URL: https://issues.apache.org/jira/browse/OAK-4432 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain >Priority: Minor > Labels: candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.6 > > > The call to OakFileDataStore#getAllIdentifiers should ignore the the files > directly at the root of the DataStore (These files are used for > SharedDataStore etc). This does not cause any functional problems but leads > to logging warning in the logs. > There is already a check but it fails when the data store root is specified > as a relative path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4429) [oak-blob-cloud] S3Backend#getAllIdentifiers should not store all elements in memory
[ https://issues.apache.org/jira/browse/OAK-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davide Giannella updated OAK-4429: -- Fix Version/s: (was: 1.5.3) 1.6 > [oak-blob-cloud] S3Backend#getAllIdentifiers should not store all elements in > memory > > > Key: OAK-4429 > URL: https://issues.apache.org/jira/browse/OAK-4429 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain > Labels: candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.6 > > > While fetching all blob ids from S3 the data is stored on memory before > returning an iterator over it. This can be problematic when the number of > blobs stored S3 are in millions. > The code should be changed to store the elements on a temp file and then > returning an iterator over it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OAK-4433) Release Oak 1.5.3
Davide Giannella created OAK-4433: - Summary: Release Oak 1.5.3 Key: OAK-4433 URL: https://issues.apache.org/jira/browse/OAK-4433 Project: Jackrabbit Oak Issue Type: Task Reporter: Davide Giannella Assignee: Davide Giannella Fix For: 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (OAK-4379) Batch mode for SyncMBeanImpl
[ https://issues.apache.org/jira/browse/OAK-4379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angela resolved OAK-4379. - Resolution: Fixed Fix Version/s: 1.5.4 > Batch mode for SyncMBeanImpl > > > Key: OAK-4379 > URL: https://issues.apache.org/jira/browse/OAK-4379 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: auth-external >Reporter: angela >Assignee: angela > Labels: performance > Fix For: 1.5.4 > > Attachments: OAK-4379.patch > > > the {{SyncMBeanImpl}} currently calls {{Session.save()}} for every single > sync, which IMO make the synchronization methods extra expensive. > IMHO we should consider introducing a batch mode that reduces the number of > save calls. the drawback of this was that the complete set of sync-calls > withing a given batch would succeed or fail. in case of failure the > 'original' sync-result would need to be replaced by one with operation status > 'ERR'. > now that we have the basis for running benchmarks for the {{SyncMBeanImpl}}, > we should be able to verify if this proposal actually has a positive impact > (though benchmark results from OAK-4119 and OAK-4120 seem to indicate that > this is the case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4426) RepositorySidegrade: oak-segment to oak-segment-tar should drop the name length check
[ https://issues.apache.org/jira/browse/OAK-4426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-4426: --- Fix Version/s: (was: 1.5.4) 1.5.3 > RepositorySidegrade: oak-segment to oak-segment-tar should drop the name > length check > - > > Key: OAK-4426 > URL: https://issues.apache.org/jira/browse/OAK-4426 > Project: Jackrabbit Oak > Issue Type: Bug > Components: segment-tar, upgrade >Reporter: Alex Parvulescu >Assignee: Tomek Rękawek > Fix For: 1.6, 1.5.3 > > > As mentioned on OAK-4260, this name length verification is causing some data > to be dropped from the upgrade. This limitation only affects mongomk > deployments, so it should not apply here. > {code} > *WARN* org.apache.jackrabbit.oak.upgrade.nodestate.NameFilteringNodeState - > Node name 'node-name' too long. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (OAK-4102) Break cyclic dependency of FileStore and SegmentTracker
[ https://issues.apache.org/jira/browse/OAK-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316418#comment-15316418 ] Francesco Mari edited comment on OAK-4102 at 6/6/16 12:31 PM: -- The work for OAK-4373 made very easy to find a (potential) solution for this issue. With [^OAK-4102-01.patch] I propose the following changes: # Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb pair. # Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}. # In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} instance. # {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} from the {{SegmentTracker}}, which is more and more an implementation detail of the {{SegmentStore}}. I run every unit/integration test locally, and I didn't see any failure. The patch still needs some work, especially regarding documentation. [~mduerig], what do you think about this? was (Author: frm): The work for OAK-4373 made very easy to find a (potential) solution for this issue. With [^OAK-4102-01.patch] I propose the following changes: # Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb pair. # Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}. # In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} instance. # {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} from the {{SegmentTracker}}, which is more and more an implementation detail of the {{SegmentStore}}. I run every unit/integration test locally, and I didn't see any failure. The patch still needs some work, especially regarding documentation. [~mduerig], what do you think about this? > Break cyclic dependency of FileStore and SegmentTracker > --- > > Key: OAK-4102 > URL: https://issues.apache.org/jira/browse/OAK-4102 > Project: Jackrabbit Oak > Issue Type: Technical task > Components: segment-tar >Reporter: Michael Dürig >Assignee: Francesco Mari > Labels: technical_debt > Fix For: 1.6 > > Attachments: OAK-4102-01.patch > > > {{SegmentTracker}} and {{FileStore}} are mutually dependent on each other. > This is problematic and makes initialising instances of these classes > difficult: the {{FileStore}} constructor e.g. passes a not fully initialised > instance to the {{SegmentTracker}}, which in turn writes an initial node > state to the store. Notably using the not fully initialised {{FileStore}} > instance! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4102) Break cyclic dependency of FileStore and SegmentTracker
[ https://issues.apache.org/jira/browse/OAK-4102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francesco Mari updated OAK-4102: Attachment: OAK-4102-01.patch The work for OAK-4373 made very easy to find a (potential) solution for this issue. With [^OAK-4102-01.patch] I propose the following changes: # Introduce a {{SegmentIdMaker}} to create a new {{SegmentId}} given a msb/lsb pair. # Remove the reference from {{SegmentIdTable}} to {{SegmentStore}}. # In {{SegmentIdTable}} and {{SegmentTracker}}, every method that performs an insertion of a new {{SegmentId}} receives a reference to a {{SegmentIdMaker}}. The creation of the {{SegmentId}} is delegated to the {{SegmentIdMaker}} instance. # {{SegmentStore}} can be used to lookup existing {{SegmentId}} or to create new data/bulk {{SegmentId}}. This change shelters users of a {{SegmentStore}} from the {{SegmentTracker}}, which is more and more an implementation detail of the {{SegmentStore}}. I run every unit/integration test locally, and I didn't see any failure. The patch still needs some work, especially regarding documentation. [~mduerig], what do you think about this? > Break cyclic dependency of FileStore and SegmentTracker > --- > > Key: OAK-4102 > URL: https://issues.apache.org/jira/browse/OAK-4102 > Project: Jackrabbit Oak > Issue Type: Technical task > Components: segment-tar >Reporter: Michael Dürig >Assignee: Francesco Mari > Labels: technical_debt > Fix For: 1.6 > > Attachments: OAK-4102-01.patch > > > {{SegmentTracker}} and {{FileStore}} are mutually dependent on each other. > This is problematic and makes initialising instances of these classes > difficult: the {{FileStore}} constructor e.g. passes a not fully initialised > instance to the {{SegmentTracker}}, which in turn writes an initial node > state to the store. Notably using the not fully initialised {{FileStore}} > instance! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads
[ https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-3865: --- Description: *Introduction* In the current trunk we'll only read document _D_ from the secondary instance if: (1) we have the parent _P_ of document _D_ cached and (2) the parent hasn't been modified in 6 hours. The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica stats. It was unreliable, so the second approach was to read the last revisions directly from each Mongo instance. If the modification date of _P_ is before last revisions on all secondary Mongos, then secondary can be used. The main problem with this approach is that we still need to have the _P_ to be in cache. I think we need another way to optimise the secondary reading, as right now only about 3% of requests connects to the secondary, which is bad especially for the global-clustering case (Mongo and Oak instances across the globe). The optimisation provided in OAK-2106 doesn't make the things much better and may introduce some consistency issues. *Proposal - tldr version* Oak will remember the last revision it has ever seen. In the same time, it'll query each secondary Mongo instance, asking what's the available stored root revision. If all secondary instances have a root revision >= last revision seen by a given Oak instance, it's safe to use the secondary read preference. *Proposal* I had following constraints in mind preparing this: 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from revision _R2_ then reading from a secondary shouldn't result in getting older revision (eg. _R1_). 2. If an Oak instance modifies a document, then reading from a secondary shouldn't result in getting the old version (before modification). So, let's have two maps: * _M1_ the most recent document revision read from the Mongo for each cluster id, * _M2_ the oldest last rev value for root document for each cluster id read from all the secondary instances. Maintaining _M1_: For every read from the Mongo we'll check if the lastRev for some cluster id is newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add the saved revision id with the current cluster id in _M1_. Maintaining _M2_: It should be periodically updated. Such mechanism is already prepared in the OAK-2106 patch. The method deciding whether we can read from the secondary instance should compare two maps. If all entries in _M2_ are newer than _M1_ it means that the secondary instances contains at least as new repository state as we already accessed and therefore it's safe to read from secondary. Regarding the documents modified by the local Oak instance, we should remember all the locally-modified paths and their revisions and use primary Mongo to access them as long as the changes are not replicated to all the secondaries. When the secondaries are up to date with the modification, we can remove it from the local-changes collections. Attached image diagram.png presents the idea. was: *Introduction* In the current trunk we'll only read document _D_ from the secondary instance if: (1) we have the parent _P_ of document _D_ cached and (2) the parent hasn't been modified in 6 hours. The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica stats. It was unreliable, so the second approach was to read the last revisions directly from each Mongo instance. If the modification date of _P_ is before last revisions on all secondary Mongos, then secondary can be used. The main problem with this approach is that we still need to have the _P_ to be in cache. I think we need another way to optimise the secondary reading, as right now only about 3% of requests connects to the secondary, which is bad especially for the global-clustering case (Mongo and Oak instances across the globe). The optimisation provided in OAK-2106 doesn't make the things much better and may introduce some consistency issues. *Proposal* I had following constraints in mind preparing this: 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from revision _R2_ then reading from a secondary shouldn't result in getting older revision (eg. _R1_). 2. If an Oak instance modifies a document, then reading from a secondary shouldn't result in getting the old version (before modification). So, let's have two maps: * _M1_ the most recent document revision read from the Mongo for each cluster id, * _M2_ the oldest last rev value for root document for each cluster id read from all the secondary instances. Maintaining _M1_: For every read from the Mongo we'll check if the lastRev for some cluster id is newer than _M1_ entry. If so, we'll update _M1_. For all wr
[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads
[ https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-3865: --- Attachment: (was: OAK-3865.patch) > New strategy to optimize secondary reads > > > Key: OAK-3865 > URL: https://issues.apache.org/jira/browse/OAK-3865 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: mongomk >Reporter: Tomek Rękawek > Labels: performance > Fix For: 1.6 > > Attachments: OAK-3865.patch, diagram.png > > > *Introduction* > In the current trunk we'll only read document _D_ from the secondary instance > if: > (1) we have the parent _P_ of document _D_ cached and > (2) the parent hasn't been modified in 6 hours. > The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica > stats. It was unreliable, so the second approach was to read the last > revisions directly from each Mongo instance. If the modification date of _P_ > is before last revisions on all secondary Mongos, then secondary can be used. > The main problem with this approach is that we still need to have the _P_ to > be in cache. I think we need another way to optimise the secondary reading, > as right now only about 3% of requests connects to the secondary, which is > bad especially for the global-clustering case (Mongo and Oak instances across > the globe). The optimisation provided in OAK-2106 doesn't make the things > much better and may introduce some consistency issues. > *Proposal* > I had following constraints in mind preparing this: > 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and > _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from > revision _R2_ then reading from a secondary shouldn't result in getting older > revision (eg. _R1_). > 2. If an Oak instance modifies a document, then reading from a secondary > shouldn't result in getting the old version (before modification). > So, let's have two maps: > * _M1_ the most recent document revision read from the Mongo for each cluster > id, > * _M2_ the oldest last rev value for root document for each cluster id read > from all the secondary instances. > Maintaining _M1_: > For every read from the Mongo we'll check if the lastRev for some cluster id > is newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add > the saved revision id with the current cluster id in _M1_. > Maintaining _M2_: > It should be periodically updated. Such mechanism is already prepared in the > OAK-2106 patch. > The method deciding whether we can read from the secondary instance should > compare two maps. If all entries in _M2_ are newer than _M1_ it means that > the secondary instances contains at least as new repository state as we > already accessed and therefore it's safe to read from secondary. > Regarding the documents modified by the local Oak instance, we should > remember all the locally-modified paths and their revisions and use primary > Mongo to access them as long as the changes are not replicated to all the > secondaries. When the secondaries are up to date with the modification, we > can remove it from the local-changes collections. > Attached image diagram.png presents the idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-3865) New strategy to optimize secondary reads
[ https://issues.apache.org/jira/browse/OAK-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-3865: --- Attachment: OAK-3865.patch > New strategy to optimize secondary reads > > > Key: OAK-3865 > URL: https://issues.apache.org/jira/browse/OAK-3865 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: mongomk >Reporter: Tomek Rękawek > Labels: performance > Fix For: 1.6 > > Attachments: OAK-3865.patch, diagram.png > > > *Introduction* > In the current trunk we'll only read document _D_ from the secondary instance > if: > (1) we have the parent _P_ of document _D_ cached and > (2) the parent hasn't been modified in 6 hours. > The OAK-2106 tried to optimise (2) by estimating lag using MongoDB replica > stats. It was unreliable, so the second approach was to read the last > revisions directly from each Mongo instance. If the modification date of _P_ > is before last revisions on all secondary Mongos, then secondary can be used. > The main problem with this approach is that we still need to have the _P_ to > be in cache. I think we need another way to optimise the secondary reading, > as right now only about 3% of requests connects to the secondary, which is > bad especially for the global-clustering case (Mongo and Oak instances across > the globe). The optimisation provided in OAK-2106 doesn't make the things > much better and may introduce some consistency issues. > *Proposal* > I had following constraints in mind preparing this: > 1. Let's assume we have a sequence of commits with revisions _R1_, _R2_ and > _R3_ modifying nodes _N1_, _N2_ and _N3_. If we already read the _N1_ from > revision _R2_ then reading from a secondary shouldn't result in getting older > revision (eg. _R1_). > 2. If an Oak instance modifies a document, then reading from a secondary > shouldn't result in getting the old version (before modification). > So, let's have two maps: > * _M1_ the most recent document revision read from the Mongo for each cluster > id, > * _M2_ the oldest last rev value for root document for each cluster id read > from all the secondary instances. > Maintaining _M1_: > For every read from the Mongo we'll check if the lastRev for some cluster id > is newer than _M1_ entry. If so, we'll update _M1_. For all writes we'll add > the saved revision id with the current cluster id in _M1_. > Maintaining _M2_: > It should be periodically updated. Such mechanism is already prepared in the > OAK-2106 patch. > The method deciding whether we can read from the secondary instance should > compare two maps. If all entries in _M2_ are newer than _M1_ it means that > the secondary instances contains at least as new repository state as we > already accessed and therefore it's safe to read from secondary. > Regarding the documents modified by the local Oak instance, we should > remember all the locally-modified paths and their revisions and use primary > Mongo to access them as long as the changes are not replicated to all the > secondaries. When the secondaries are up to date with the modification, we > can remove it from the local-changes collections. > Attached image diagram.png presents the idea. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4200) [BlobGC] Improve collection times of blobs available
[ https://issues.apache.org/jira/browse/OAK-4200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316407#comment-15316407 ] Amit Jain commented on OAK-4200: [~tmueller], [~chetanm] Could you please review the changes done @ https://github.com/amit-jain/jackrabbit-oak/commit/a0a743467629b6695d1ed1616cf6fc85e2f6610b > [BlobGC] Improve collection times of blobs available > > > Key: OAK-4200 > URL: https://issues.apache.org/jira/browse/OAK-4200 > Project: Jackrabbit Oak > Issue Type: Improvement >Reporter: Amit Jain >Assignee: Amit Jain > Fix For: 1.5.4 > > > The blob collection phase (Identifying all the blobs available in the data > store) is quite an expensive part of the whole GC process, taking up a few > hours sometimes on large repositories, due to iteration of the sub-folders in > the data store. > In an offline discussion with [~tmueller] and [~chetanm], the idea came up > that this phase can be faster if > * Blobs ids are tracked when the blobs are added for e.g. in a simple file > in the datastore per cluster node. > * GC then consolidates this file from all the cluster nodes and uses it to > get the candidates for GC. > * This variant of the MarkSweepGC can be triggered more frequently. It would > be ok to miss blob id additions to this file during a crash etc., as these > blobs can be cleaned up in the *regular* MarkSweepGC cycles triggered > occasionally. > We also may be able to track other metadata along with the blob ids like > paths, timestamps etc. for auditing/analytics, in-conjunction with OAK-3140. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316406#comment-15316406 ] Tomek Rękawek commented on OAK-4412: [~chetanm], thanks for the feedback. I'd be more happy with only using the observer as well. My main concern is that observer is informed about the changes asynchronously, so it may happen that the user commits() their changes and run the JCR query() before the observer event is processed. Isn't this possible or likely? Also, I've followed your advice about the indentation. Thanks, the patch is now much smaller. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-4412: --- Attachment: (was: OAK-4412.patch) > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomek Rękawek updated OAK-4412: --- Attachment: OAK-4412.patch > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch, OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4412) Lucene-memory property index
[ https://issues.apache.org/jira/browse/OAK-4412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316394#comment-15316394 ] Vikas Saurabh commented on OAK-4412: Just to clarify - which problem are we trying to solve: # sometimes, due do delay in async indexing and background read, subsequent request from the same user get missing (yet un-indexed/un-bkRead) result? # some code patterns which currently _expect_ synchronous nature of property index (do change -> save -> query -> expect earlier save to show up) won't cope well with async nature? I understand the former problem statement has value and worth solving. But, my reading of this issue felt like we are trying to address the latter (code expectation requires sync nature BUT prop indices are currently expensive). I think, due to different expectations (wrt to timings - diff between 2 user requests v/s diff between 2 actions in same call stack) we might want to discuss the 2 problems separately. It'd great if same solution solves both cases - but I think we shouldn't force same solution for both cases. > Lucene-memory property index > > > Key: OAK-4412 > URL: https://issues.apache.org/jira/browse/OAK-4412 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: lucene >Reporter: Tomek Rękawek >Assignee: Tomek Rękawek > Fix For: 1.6 > > Attachments: OAK-4412.patch > > > When running Oak in a cluster, each write operation is expensive. After > performing some stress-tests with a geo-distributed Mongo cluster, we've > found out that updating property indexes is a large part of the overall > traffic. > The asynchronous index would be an answer here (as the index update won't be > made in the client request thread), but the AEM requires the updates to be > visible immediately in order to work properly. > The idea here is to enhance the existing asynchronous Lucene index with a > synchronous, locally-stored counterpart that will persist only the data since > the last Lucene background reindexing job. > The new index can be stored in memory or (if necessary) in MMAPed local > files. Once the "main" Lucene index is being updated, the local index will be > purged. > Queries will use an union of results from the {{lucene}} and > {{lucene-memory}} indexes. > The {{lucene-memory}} index, as a local stored entity, will be updated using > an observer, so it'll get both local and remote changes. > The original idea has been suggested by [~chetanm] in the discussion for the > OAK-4233. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
[ https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316324#comment-15316324 ] Amit Jain commented on OAK-4430: The method {{DataStoreBlobStore#getAllChunkIds}} also used the DataRecord fetched to encode the length in the id. Considering that this method has only one consumer i.e. the {{MarkSweepGarbageCollector}}, we could alter this method itself to not encode the blob ids with the length and clearly specify in the javadocs. Alternately, we could add an overloaded method that returns all raw blob ids. Either way this would require a method which the gc class can use to get a raw id given a length encoded id which the "node store referenced blobs" collection phase would return. [~chetanm] wdyt? > DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed > > > Key: OAK-4430 > URL: https://issues.apache.org/jira/browse/OAK-4430 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain > Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.5.3 > > > DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the > lastModifiedTime criteria is satisfied against the given > {{maxLastModifiedTime}}. > When the {{maxLastModifiedTime}} has a value 0 it effectively means ignore > any last modified time check (and which is the only usage currently from > MarkSweepGarbageCollector). This should ignore fetching the DataRecords as > this can be very expensive for e.g on calls to S3 with millions of blobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4430) DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed
[ https://issues.apache.org/jira/browse/OAK-4430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316317#comment-15316317 ] Amit Jain commented on OAK-4430: The method {{DataStoreBlobStore#getAllChunkIds}} also used the DataRecord fetched to encode the length in the id. Considering that this method has only one consumer i.e. the {{MarkSweepGarbageCollector}}, we could alter this method itself to not encode the blob ids with the length and clearly specify in the javadocs. Alternately, we could add an overloaded method that returns all raw blob ids. Either way this would require a method which the gc class can use to get a raw id given a length encoded id which the "node store referenced blobs" collection phase would return. [~chetanm] wdyt? > DataStoreBlobStore#getAllChunkIds fetches DataRecord when not needed > > > Key: OAK-4430 > URL: https://issues.apache.org/jira/browse/OAK-4430 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain > Labels: candidate_oak_1_0, candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.5.3 > > > DataStoreBlobStore#getAllChunkIds loads the DataRecord for checking that the > lastModifiedTime criteria is satisfied against the given > {{maxLastModifiedTime}}. > When the {{maxLastModifiedTime}} has a value 0 it effectively means ignore > any last modified time check (and which is the only usage currently from > MarkSweepGarbageCollector). This should ignore fetching the DataRecords as > this can be very expensive for e.g on calls to S3 with millions of blobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (OAK-4422) support cluster for FileBlobStore
[ https://issues.apache.org/jira/browse/OAK-4422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15316310#comment-15316310 ] Marco Piovesana commented on OAK-4422: -- Yes we use TAR. We have a websphere cluster with a file system shared between all the nodes. In this file system there is the oak node store. This is how we instantiate the repository: {code:title=RepositoryCreation|borderStyle=solid} BlobStore blobStore = new FileBlobStore(dataStoreFile.getAbsolutePath()); FileStore repositoryStore = FileStore.newFileStore(repositoryFile).withBlobStore(blobStore).create(); NodeStore nodeStore = SegmentNodeStore.newSegmentNodeStore(repositoryStore).create(); Jcr jcr = new Jcr(nodeStore).with(new InitialContent()).with(new SecurityProviderImpl()); Repository repository = jcr.createRepository(); {code} > support cluster for FileBlobStore > - > > Key: OAK-4422 > URL: https://issues.apache.org/jira/browse/OAK-4422 > Project: Jackrabbit Oak > Issue Type: Improvement > Components: blob >Affects Versions: 1.4.3 >Reporter: Marco Piovesana > > I'm using Oak in a system where the user can store arbitrary large binary > files and because of that I thought the best option was to use the > FileBlobStore as storage subsystem. > Now I need to port this solution on a cluster environment, but i saw that > clustering is supported only for Mongo and RDBMS storage systems. Is there > any plan to suppor it also for the Blob storage? There's a better option? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (OAK-4432) Ignore files in the root directory of the FileDataStore in #getAllIdentifiers
[ https://issues.apache.org/jira/browse/OAK-4432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amit Jain updated OAK-4432: --- Summary: Ignore files in the root directory of the FileDataStore in #getAllIdentifiers (was: Ignore files in the root directory of the DataStore) > Ignore files in the root directory of the FileDataStore in #getAllIdentifiers > - > > Key: OAK-4432 > URL: https://issues.apache.org/jira/browse/OAK-4432 > Project: Jackrabbit Oak > Issue Type: Bug > Components: blob >Reporter: Amit Jain >Assignee: Amit Jain >Priority: Minor > Labels: candidate_oak_1_2, candidate_oak_1_4 > Fix For: 1.5.3 > > > The call to OakFileDataStore#getAllIdentifiers should ignore the the files > directly at the root of the DataStore (These files are used for > SharedDataStore etc). This does not cause any functional problems but leads > to logging warning in the logs. > There is already a check but it fails when the data store root is specified > as a relative path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (OAK-4432) Ignore files in the root directory of the DataStore
Amit Jain created OAK-4432: -- Summary: Ignore files in the root directory of the DataStore Key: OAK-4432 URL: https://issues.apache.org/jira/browse/OAK-4432 Project: Jackrabbit Oak Issue Type: Bug Components: blob Reporter: Amit Jain Assignee: Amit Jain Priority: Minor Fix For: 1.5.3 The call to OakFileDataStore#getAllIdentifiers should ignore the the files directly at the root of the DataStore (These files are used for SharedDataStore etc). This does not cause any functional problems but leads to logging warning in the logs. There is already a check but it fails when the data store root is specified as a relative path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)