Chetan Mehrotra created OAK-7105:
------------------------------------

             Summary: Implement a traverse with sort strategy for 
DocumentStoreIndexer
                 Key: OAK-7105
                 URL: https://issues.apache.org/jira/browse/OAK-7105
             Project: Jackrabbit Oak
          Issue Type: Improvement
          Components: run
            Reporter: Chetan Mehrotra
            Assignee: Chetan Mehrotra
             Fix For: 1.8, 1.7.15


Currently the DocumentStoreIndexer logic uses a StoreAndSortStrategy in which 
it first dumps all nodestates to a json file -> sort them in batches -> merge 
the sorted file. In whole indexing the sorting phase is taking decent amount of 
time (40 mins out of 3 hr run).

Further this approach suffers with potential OOM while ExternalSort creates in 
memory batches where actual size of batch exceeds the estimated size 
considerably. So we need to constant tweak the "oak.indexer.maxSortMemoryInGB" 
(currently set to 2 GB)

As an improvement we can do following changes

# Implement a traverse with sort strategy - Here instead of first dumping all 
nodestate in a single big json we instead add them to an in memory buffer and 
then at some stage sort the batch and save it to file
# Use better memory checks - Use the approach as implemented in GCBarrier i.e. 
monitor the current memory usage and if it goes below certain threshold trigger 
the batch sort



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to