I think part of my confusion stems from the gap in the level of abstraction.Druid terminology focuses on aggregation.But in sketches there are two levels of aggregation:A sketch is a first level aggregation which holds the gist of the stream, A union is a second level aggregation which can aggregate sketches. Adding 2 questions to the questions below: The locks are used to synchronize the access to the union - I assume the union is a second level aggregation merging sketches that are built during ingestion. 3) If this is not the case then why does druid apply a union and not simply uses a sketch to aggregate the data?4) if it is the case then is it guaranteed that the merged sketches are immutable? otherwise wrapping the union with locks is not enough.
I hope my questions make more sense now. Thanks,Eshcar On Sunday, July 22, 2018, 4:15:02 PM GMT+3, Eshcar Hillel <esh...@oath.com> wrote: Thanks Gian - I was missing the part about aggregation during ingestion time roll-up. I looked at the SketchAggregator code and read the druid overview document let me verify that I got this right.Consider the best effort roll up mode: as events arrive they are ingested into multiple segments, call these s0...s9, but should belong to a single segment.Then a roll-up process ru aggregates s0...s9 one-by-one (?) into a single segment. During the roll up ru can be queried and therefore needs to be thread safe. 1) who is the "owner" of the roll up process? what triggers the roll-up thread? Is it considered as part of the ingestion/indexing time, or is it done at the background as a kind of an optimization? 2) The documents says "Data is queryable as soon as it isingested by the realtime processing logic." Does this means that queries can apply get to s0..s9? should they be thread safe as well? On Thursday, July 19, 2018, 10:16:34 PM GMT+3, Gian Merlino <g...@apache.org> wrote: Hi Eshcar, I don't think I 100% understand what you are asking, but I will say some things, and hopefully they will be helpful. In Druid we use aggregators for two things: aggregation during ingestion (for ingestion-time rollup) and aggregation during queries. During queries the aggregators are only ever used by one thread at a time. At ingestion time, "aggregate" and "get" can be called simultaneously. It happens because "aggregate" is called from an ingestion thread (because we update running aggregators during ingestion), and "get" is called by query threads (because they "get" those aggregator values from the ingestion aggregator object to feed them to a query aggregator object). These calls are not synchronized by Druid, so individual aggregators need to do it themselves if necessary. There was some effort to address this systematically: https://github.com/apache/incubator-druid/pull/3956, although it hasn't been finished yet. Check out some of the discussion on that patch for more background, and a question I just posted there: does it make more sense for ingestion-time aggregator thread-safety to be handled systematically (at the IncrementalIndex) or for each aggregator to need to be thread safe? If you're looking at "aggregate" and "get" in this file, those are the two that could get called simultaneously: https://github.com/apache/incubator-druid/blob/master/extensions-core/datasketches/src/main/java/io/druid/query/aggregation/datasketches/theta/SketchAggregator.java On Sun, Jul 15, 2018 at 12:11 AM Eshcar Hillel <esh...@oath.com.invalid> wrote: > Apologies, I must be missing something very basic in how incremental > indexing is working.A sketch is by itself an aggregator - it can absorb > millions of updates before it exceeds its space limit or is flushed to disk. > I assumed the ingestion thread aggregates data in multiple sketches in > parallel, then at query time a union operation is invoked to merge relevant > sketches based on the attributes of the query, and when the union is > completed its result is returned to the user. But in such scenario there is > no need to call get before the union is completed. > This means there is another scenario where union is used and can be > queried while in the process of executing the merge. Is this to maintain > some in-memory hierarchy of aggregations? or for creating the snapshots > that are flushed to disk? > A better understanding of the use case will help in presenting a better > thread-safe solution. > Thanks,Eshcar > > On Wednesday, July 11, 2018, 7:51:24 PM GMT+3, Gian Merlino < > g...@apache.org> wrote: > > Hi Eshcar, > > > But even in a single-writer-single-reader scenario removing the lock can > increase the throughput of accesses to the object. > > Definitely worth trying this out, imo. > > > However, I don't understand why is the union object read before the > result is ready. > > It's used as part of incremental indexing: the idea is that we create > aggregates during ingestion time and we want those to be queryable even > while ingestion is still ongoing. So the ingestion thread will be calling > "aggregate" and a query thread will be calling "get" potentially > simultaneously. > > On Wed, Jul 11, 2018 at 1:04 AM Eshcar Hillel <esh...@oath.com.invalid> > wrote: > > > Thanks Gian, > > This is also my understanding.But even in a single-writer-single-reader > > scenario removing the lock can increase the throughput of accesses to the > > object. > > If the union is only used to produce the result at query time then > > removing the lock would not affect ingestion throughput, but could > decrease > > query latency.However, I don't understand why is the union object read > > before the result is ready. > > On Tuesday, July 10, 2018, 8:13:36 PM GMT+3, Gian Merlino < > > g...@apache.org> wrote: > > > > Hi Eshcar, > > > > To my knowledge, in the Druid Aggregator and BufferAggregator interfaces, > > the main place where concurrency happens is that "aggregate" and "get" > may > > be called simultaneously during realtime ingestion. So if there would be > a > > benefit from improving concurrency it would probably end up in that area. > > > > On Tue, Jul 10, 2018 at 2:10 AM Eshcar Hillel <esh...@oath.com.invalid> > > wrote: > > > > > Hi All, > > > My name is Eshcar Hillel from Oath research. I'm currently working with > > > Lee Rhodes on committing a new concurrent implementation of the theta > > > sketch to the sketches-core library.I was wondering whether this > > > implementation can help boost the union operation that is applied to > > > multiple sketches at query time in druid.From what I see in the code > the > > > sketch aggregator uses the SynchronizedUnion implementation, which > > > basically uses a lock at every single access (update/read) of the union > > > operation. We believe a thread-safe implementation of the union > operation > > > can help decrease the inherent overhead of the lock. > > > I will be happy to join the meeting today and briefly discuss this > > option. > > > Thanks,Eshcar > > > > > > > > > > > >