Thanks Chen for your notes on task view. Ultimately the things you mention (limit & filters) are what I am trying to get pushed down into the SegmentsTable.
I think the changes you mention (LIMIT & filters) will help to reduce the serialization costs, and will benefit when put together with the following changes. >From profiling on our service I discovered the bulk of the time is spent in HeapMemory::getCompletedTaskInfoByCreatedTimeDuration. Specifically there are two observations: (1) the work occurs underneath a lock so it is concurrency constrained to at most 1 CPU, and (2) compounding this, the sortedCopy must materialize the entire list even before the filter or limit occurs, which is a much more expensive operation (better to filter before sort/limit). These two changes result in a significant CPU reduction, lower latencies, as well as eliminating contention when multiple sources users call the /tasks or /taskStatus API I'll add these details on the tickets you mention. I have a couple of PRs that we are reviewing locally on our side and I'll submit them as PRs upstream (hopefully soon!). Thanks Jason On Tue, May 18, 2021 at 6:13 AM Chen Frank <frank.chen...@outlook.com> wrote: > > Hi Jason > > I have tracked this problem for quite a while. Since you are interested in > it, I would like to share something I know with you so that you could take > these in consideration. > > In 0.19.0, there was a PR #9883 improving the performance of segments query > by eliminating the JSON serialization. > But PR #10752 merged in 0.21.0 brings back JSON serialization. I do not know > whether this change reverts the performance gain in previous PR. > > For tasks, the performance is much worse. There are some problems reported > about task UI, e.g. #11042 and #11140. But I do not see any feedback on > segment UI. > One reason is that the web-console fetches ALL task records from broker and > does pagination at client side instead of using a LIMIT clause in SQL to do > pagination at server side. > Another reason is that broker fetches ALL tasks via REST API from overlord > that loads records from metadata storage directly and deserializes data from > `pay_load` field. > > While For segments, the two problems above do not exist because > > 1. LIMIT clause is used in SQL queries > > 2. segments query returns a snapshot in-memory segment data which means > there is no query to metadata database and JSON deserialization of `pay_load` > field. > > In 0.20, OFFSET is supported for SQL queries, I think this could also be > added to the queries from web console which would bring some performance gain > in some extent. > > IMO, to improve the performance, we might need to make changes to > > 1. the SQL layer you mentioned above > > 2. the SQL clauses from web console > > 3. the task REST API to support search conditions and ordering to narrow > down the search range on metadata table > > Thanks. > > 发件人: Jason Koch <jk...@netflix.com.INVALID> > 日期: 星期六, 2021年5月15日 上午3:51 > 收件人: dev@druid.apache.org <dev@druid.apache.org> > 主题: Re: Push-down of operations for SystemSchema tables > @Julian - thank you for review & confirming. > > Hi Clint > > Thank you, I appreciate the response. I have responded Inline, some > q's, I've also written in my words as a confirmation that I understand > ... > > > In the mid term, I think that some of us have been thinking that moving > > system tables into the Druid native query engine is the way to go, and have > > been working on resolving a number of hurdles that are required to make > > this happen. One of the main motivators to do this is so that we have just > > the Druid query path in the planner in the Calcite layer, and deprecating > > and eventually dropping the "bindable" path completely, described in > > https://github.com/apache/druid/issues/9896. System tables would be pushed > > into Druid Datasource implementations, and queries would be handled in the > > native engine. Gian has even made a prototype of what this might look like, > > https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native > > since much of the ground work is now in place, though it takes a hard-line > > approach of completely removing bindable instead of hiding it behind a > > flag, and doesn't implement all of the system tables yet, at least last > > time I looked at it. > > Looking over the changes it seems that: > - a new VirtualDataSource is introduced, which the Druid non-sql > processing engine can process, that can wrap an Iterable. This exposes > lazy segment & iterable using InlineDataSource. > - the SegmentsTable has been converted from a ScannableTable to a > DruidTable, and a ScannableTableIterator is introduced to generate an > iterable containing the rows; the new VirtualDataSource can be used to > access the rows of this table. > - finally, the Bindable convention is discarded from DruidPlanner and Rules. > > > I think there are a couple of remaining parts to resolve that would make > > this feasible. The first is native scan queries need support for ordering > > by arbitrary columns, instead of just time, so that we can retain > > capabilities of the existing system tables. > > It seems you want to use the native queries to support ordering; do > you mean here the underlying SegmentsTable, or something in the Druid > engine? Currently, the SegmentsTable etc relies on, as you say, the > bindable convention to provide sort. If it was a DruidTable then it > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery, > which conceptually is able to do a sort, but as described in [1] [2] > the ordering is not supported by the underlying druid engine [3]. > > This would mean that an order by, sort, limit query would not be > supported on any of the migrated sys.* tables until Druid has a way to > perform the sort on a ScanQuery. > > [1] > https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering > [2] > https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078 > [3] > https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java > > > This isn't actually a blocker > > for adding native system table queries, but rather a blocker for replacing > > the bindable convention by default so that there isn't a loss (or rather > > trade) of functionality. Additionally, I think there is maybe some matters > > regarding authorization of system tables when handled by the native engine > > that will need resolved, but this can be done while adding the native > > implementations. > > It looks like the port of the tables from classic ScannableTable to a > DruidTable itself is straightforward. However, it seems this PR > doesn't bring them across from SQL domain to be available in any > native queries. I'm not sure if this is expected or an interim step or > if I have misunderstood the goal. > > > I think there are some various ideas and experiments underway of how to do > > sorting on scan queries at normal Druid datasource scale, which is sort of > > a big project, but in the short term we might be able to do something less > > ambitious that works well enough at system tables scale to allow this plan > > to fully proceed. > > One possible way, that I think leads in the correct direction: > 1) We have an existing rule for LogicalTable with DruidTable to > DruidQueryRel which can eventually construct a DruidQuery. > 2) The VirtualDataSource, created during SQL parsing takes an > already-constructed Iterable; so, we need to have already performed > the filter/sort before creating the VirtualDataSource (and > DruidQuery). This means the push-down filter logic has to happen > during sql/ stage setup and before handoff to processing/ engine. > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and > push down? Then, our SegmentTable and friends can expose the correct > Iterable. This should allow us to solve the perf concerns, and would > allow us to present a correctly constructed VirtualDataSource. Sort > from SQL _should_ be supported (I think) as the planner can push the > sort etc down to these nodes directly. > > In this, the majority of the work would have had to have happened > prior to Druid engine, in sql/, before reaching Druid and so Druid > core doesn't actually need to know anything about these changes. > > On the other hand, whilst it keeps the pathway open, I'm not sure this > does any of the actual work to make the sys.* tables available as > native tables. If we are to try and make these into truly native > tables, without a native sort, and remove their implementation from > sql/, the DruidQuery in the planner would need to be configured to > pass the ScanQuery sort to the processing engine _but only for sys.* > tables_ and then processing engine would need to know how to find > these tables. (I haven't explored this). As you mention, implementing > native sort across multiple data sources seems like a more ambitious > piece of work. > > As another idea, we could consider creating a bridge > Bindable/EnumerableToDruid rule that would allow druid to embed these > tables, and move them out of sql/ into processing/, exposed as > Iterable/Enumerable, and make them available in queries if that is a > goal. I'm not really sure that adds anything to the overall goals > though. > > > Does this approach make sense? I don't believe Gian is actively working on > > this at the moment, so I think if you're interested in moving along this > > approach and want to start laying the groundwork I'm happy to provide > > guidance and help out. > > > > I am interested. For my current work, I do want to keep focus on the > sys.* performance work. If there's a way to do it and lay the > groundwork or even get all the work done, then I am 100% for that. > Looking at what you want to do to convert these sys.* to native > tables, if we have a viable solution or are comfortable with my > suggestions above I'd be happy to build it out. > > Thanks > Jason > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > For additional commands, e-mail: dev-h...@druid.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org For additional commands, e-mail: dev-h...@druid.apache.org