Re: Push-down of operations for SystemSchema tables

Jason Koch Tue, 18 May 2021 11:36:50 -0700

Thanks Chen for your notes on task view. Ultimately the things you
mention (limit & filters) are what I am trying to get pushed down into
the SegmentsTable.


I think the changes you mention (LIMIT & filters) will help to reduce
the serialization costs, and will benefit when put together with the
following changes.

>From profiling on our service I discovered the bulk of the time is
spent in HeapMemory::getCompletedTaskInfoByCreatedTimeDuration.
Specifically there are two observations: (1) the work occurs
underneath a lock so it is concurrency constrained to at most 1 CPU,
and (2) compounding this, the sortedCopy must materialize the entire
list even before the filter or limit occurs, which is a much more
expensive operation (better to filter before sort/limit). These two
changes result in a significant CPU reduction, lower latencies, as
well as eliminating contention when multiple sources users call the
/tasks or /taskStatus API

I'll add these details on the tickets you mention. I have a couple of
PRs that we are reviewing locally on our side and I'll submit them as
PRs upstream (hopefully soon!).

Thanks
Jason


On Tue, May 18, 2021 at 6:13 AM Chen Frank <[email protected]> wrote:
>
> Hi Jason
>
> I have tracked this problem for quite a while. Since you are interested in 
> it, I would like to share something I know with you so that you could take 
> these in consideration.
>
> In 0.19.0, there was a PR #9883 improving the performance of segments query 
> by eliminating the JSON serialization.
> But PR #10752 merged in 0.21.0 brings back JSON serialization. I do not know 
> whether this change reverts the performance gain in previous PR.
>
> For tasks, the performance is much worse. There are some problems reported 
> about task UI, e.g. #11042 and #11140. But I do not see any feedback on 
> segment UI.
> One reason is that the web-console fetches ALL task records from broker and 
> does pagination at client side instead of using a LIMIT clause in SQL to do 
> pagination at server side.
> Another reason is that broker fetches ALL tasks via REST API from overlord 
> that loads records from metadata storage directly and deserializes data from 
> `pay_load` field.
>
> While For segments, the two problems above do not exist because
>
> 1.     LIMIT clause is used in SQL queries
>
> 2.     segments query returns a snapshot in-memory segment data which means 
> there is no query to metadata database and JSON deserialization of `pay_load` 
> field.
>
> In 0.20, OFFSET is supported for SQL queries, I think this could also be 
> added to the queries from web console which would bring some performance gain 
> in some extent.
>
> IMO, to improve the performance, we might need to make changes to
>
> 1.     the SQL layer you mentioned above
>
> 2.     the SQL clauses from web console
>
> 3.     the task REST API to support search conditions and ordering to narrow 
> down the search range on metadata table
>
> Thanks.
>
> 发件人: Jason Koch <[email protected]>
> 日期: 星期六, 2021年5月15日 上午3:51
> 收件人: [email protected] <[email protected]>
> 主题: Re: Push-down of operations for SystemSchema tables
> @Julian - thank you for review & confirming.
>
> Hi Clint
>
> Thank you, I appreciate the response. I have responded Inline, some
> q's, I've also written in my words as a confirmation that I understand
> ...
>
> > In the mid term, I think that some of us have been thinking that moving
> > system tables into the Druid native query engine is the way to go, and have
> > been working on resolving a number of hurdles that are required to make
> > this happen. One of the main motivators to do this is so that we have just
> > the Druid query path in the planner in the Calcite layer, and deprecating
> > and eventually dropping the "bindable" path completely, described in
> > https://github.com/apache/druid/issues/9896. System tables would be pushed
> > into Druid Datasource implementations, and queries would be handled in the
> > native engine. Gian has even made a prototype of what this might look like,
> > https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > since much of the ground work is now in place, though it takes a hard-line
> > approach of completely removing bindable instead of hiding it behind a
> > flag, and doesn't implement all of the system tables yet, at least last
> > time I looked at it.
>
> Looking over the changes it seems that:
> - a new VirtualDataSource is introduced, which the Druid non-sql
> processing engine can process, that can wrap an Iterable. This exposes
> lazy segment & iterable using  InlineDataSource.
> - the SegmentsTable has been converted from a ScannableTable to a
> DruidTable, and a ScannableTableIterator is introduced to generate an
> iterable containing the rows; the new VirtualDataSource can be used to
> access the rows of this table.
> - finally, the Bindable convention is discarded from DruidPlanner and Rules.
>
> > I think there are a couple of remaining parts to resolve that would make
> > this feasible. The first is native scan queries need support for ordering
> > by arbitrary columns, instead of just time, so that we can retain
> > capabilities of the existing system tables.
>
> It seems you want to use the native queries to support ordering; do
> you mean here the underlying SegmentsTable, or something in the Druid
> engine? Currently, the SegmentsTable etc relies on, as you say, the
> bindable convention to provide sort. If it was a DruidTable then it
> seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
> which conceptually is able to do a sort, but as described in [1] [2]
> the ordering is not supported by the underlying druid engine [3].
>
> This would mean that an order by, sort, limit query would not be
> supported on any of the migrated sys.* tables until Druid has a way to
> perform the sort on a ScanQuery.
>
> [1] 
> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
> [2] 
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
> [3] 
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
>
> > This isn't actually a blocker
> > for adding native system table queries, but rather a blocker for replacing
> > the bindable convention by default so that there isn't a loss (or rather
> > trade) of functionality. Additionally, I think there is maybe some matters
> > regarding authorization of system tables when handled by the native engine
> > that will need resolved, but this can be done while adding the native
> > implementations.
>
> It looks like the port of the tables from classic ScannableTable to a
> DruidTable itself is straightforward. However, it seems this PR
> doesn't bring them across from SQL domain to be available in any
> native queries. I'm not sure if this is expected or an interim step or
> if I have misunderstood the goal.
>
> > I think there are some various ideas and experiments underway of how to do
> > sorting on scan queries at normal Druid datasource scale, which is sort of
> > a big project, but in the short term we might be able to do something less
> > ambitious that works well enough at system tables scale to allow this plan
> > to fully proceed.
>
> One possible way, that I think leads in the correct direction:
> 1) We have an existing rule for LogicalTable with DruidTable to
> DruidQueryRel which can eventually construct a DruidQuery.
> 2) The VirtualDataSource, created during SQL parsing takes an
> already-constructed Iterable; so, we need to have already performed
> the filter/sort before creating the VirtualDataSource (and
> DruidQuery). This means the push-down filter logic has to happen
> during sql/ stage setup and before handoff to processing/ engine.
> 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
> RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
> push down? Then, our SegmentTable and friends can expose the correct
> Iterable. This should allow us to solve the perf concerns, and would
> allow us to present a correctly constructed VirtualDataSource. Sort
> from SQL _should_ be supported (I think) as the planner can push the
> sort etc down to these nodes directly.
>
> In this, the majority of the work would have had to have happened
> prior to Druid engine, in sql/, before reaching Druid and so Druid
> core doesn't actually need to know anything about these changes.
>
> On the other hand, whilst it keeps the pathway open, I'm not sure this
> does any of the actual work to make the sys.* tables available as
> native tables. If we are to try and make these into truly native
> tables, without a native sort, and remove their implementation from
> sql/, the DruidQuery in the planner would need to be configured to
> pass the ScanQuery sort to the processing engine _but only for sys.*
> tables_ and then processing engine would need to know how to find
> these tables. (I haven't explored this). As you mention, implementing
> native sort across multiple data sources seems like a more ambitious
> piece of work.
>
> As another idea, we could consider creating a bridge
> Bindable/EnumerableToDruid rule that would allow druid to embed these
> tables, and move them out of sql/ into processing/, exposed as
> Iterable/Enumerable, and make them available in queries if that is a
> goal. I'm not really sure that adds anything to the overall goals
> though.
>
> > Does this approach make sense? I don't believe Gian is actively working on
> > this at the moment, so I think if you're interested in moving along this
> > approach and want to start laying the groundwork I'm happy to provide
> > guidance and help out.
> >
>
> I am interested. For my current work, I do want to keep focus on the
> sys.* performance work. If there's a way to do it and lay the
> groundwork or even get all the work done, then I am 100% for that.
> Looking at what you want to do to convert these sys.* to native
> tables, if we have a viable solution or are comfortable with my
> suggestions above I'd be happy to build it out.
>
> Thanks
> Jason
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Push-down of operations for SystemSchema tables

Reply via email to