himanshug opened a new issue #8031: remove unnecessary synchronization overhead 
from complex Aggregators
URL: https://github.com/apache/incubator-druid/issues/8031
 
 
   ### Motivation
   
   Many complex [Buffer]Aggregator implementations need to add synchronized 
access to internal data structures due to single-writer-multiple-reader 
concurrent usage of those during realtime indexing process where they are 
concurrently queried in addition to getting updated. However, that 
synchronization is totally unnecessary everywhere else but we pay its price 
anyway , for example at historical nodes while querying and in batch indexing 
tasks etc. Most recently this came up in 
https://github.com/apache/incubator-datasketches-java/issues/263 .
   
   ### Proposed changes
   I haven't really done a prototype yet but I "think" these changes should be 
doable.
   
   Add following methods (with default implementations) to `AggregatorFactory` .
   ```
     public Aggregator factorize(ColumnSelectorFactory metricFactory, boolean 
isConcurrent)
     {
       return factorize(metricFactory);
     }
   
     public BufferAggregator factorizeBuffered(ColumnSelectorFactory 
metricFactory, boolean isConcurrent)
     {
       return factorizeBuffered(metricFactory);
     }
   ```
   And, replace all calls inside druid code from 
`AggregatorFactory.factorize[Buffered](ColumnSelectorFactory)` to 
`AggregatorFactory.factorize[Buffered](ColumnSelectorFactory, boolean 
isConcurrent)` with right value for `boolean isConcurrent` specified .
   `IncrementalIndex` would be made aware of its concurrency context (by 
changing existing variable `concurrentEventAdd`  to `isConcurrent` and it being 
correctly specified in all places an `IncrementalIndex` instance is created ) 
so that it can set right value for `isConcurrent` when calling 
`factorize[Buffered](..)`
   Relevant complex aggregator such as `thetaSketch` can then override newly 
added methods to add synchronization only for cases where it is really needed.
   
   ### Rationale
   
   One other option would be that aggregator implementors get additional 
contextual information (e.g. the `nodeType` they are running on ) and based on 
that enable/disable synchronization. However, proposed approach is simpler to 
use for extension writers and takes away the guessing game.
   
   ### Operational impact
   None
   
   ### Test plan (optional)
   Existing unit/integration tests would cover the changes introduced.
   
   ### Future work (optional)
   Adjust relevant complex aggregator implementations to take advantage of 
newly added methods.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to