Hi Gian It looks like I have a solution that works, and I'd like to present it for your thoughts.
https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it checks with the underlying datasource whether a scan is supported, and if it is, it constructs a ScanQuery to match and passes it on. Later, ScanQueryEngine will look at whether the adapter is a SortedCursorFactory and if it is, it will request the cursors to be constructed and pass through all relevent sort/limit/filter to the underlying table logic. VirtualTable implementation can then perform its magic and return a result! The solution is: - Create a new VirtualDataSource, a new VirtualSegment and VirtualStorageAdapter, as well as a new SortedCursorFactory to support the VirtualDataSource. - System tables have a DruidVirtualTableRule that will convert a LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing the query authentication information. - DataSource gets a function canScanOrdered so that it can advise the Query parser whether it can handle a given sort function - default is that only existing time-ordering is supported. Implementations (ie: VirtualDataSource) can decide to accept an ordered scan or not. Overall, I think this provides a viable solution and meets your goals. In rebasing this morning I see you're on the same pathway with ordered scan query, so I could rebase on top of that and break into a smaller set of PRs, nonetheless the conceptual approach and direction is something that I think will work. Thanks! Jason On Wed, May 19, 2021 at 9:54 PM Gian Merlino <g...@apache.org> wrote: > Hey Jason, > > It sounds like we have two different, but related goals: > > 1) Your goal is to improve the performance of system tables. > > 2) My goal with the branch Clint linked is to enable using Druid's native > query engine for system tables, in order to achieve consistency in how SQL > queries are executed and also so all of Druid's special functions, > aggregations, extensions, etc, are available to use in system tables. > > Two notes about what I'm trying to do, btw, in response to things you > raised. First, I'm interested in using Druid's native query engine under > the hood, but I'm not necessarily driving towards being able to actually > query system tables using native Druid queries. It still achieves my goals > if these tables are only available in SQL queries. Second, you're correct > that for this to work for queries with 'order by' but no 'group by', we'd > need to add ordering support to the Scan query. > > That's actually the main reason I stopped working on this branch: I started > looking into Scan ordering instead. Then I got distracted with other stuff, > and now I'm working on neither of those things 🙂. Anyway, I think it'll be > somewhat involved to implement Scan ordering in a scalable way for any > possible query on any possible datasource, but if we're focused on sys > tables, we can take a shortcut that is less-scalable. It wouldn't be tough > to make something that works for anything that works today, since the > bindable convention we use today simply does the sort in memory (see > org.apache.calcite.interpreter.SortNode). That might be a good way to > unblock the sys-table-via-native-engine work. We'd just need some safeguard > to prevent that code from executing on real datasources that are too big to > materialize. Perhaps a row limit, or perhaps enabling in-memory ordering > using an undocumented Scan context parameter set by the SQL layer only for > sys tables. > > > I am interested. For my current work, I do want to keep focus on the > > sys.* performance work. If there's a way to do it and lay the > > groundwork or even get all the work done, then I am 100% for that. > > Looking at what you want to do to convert these sys.* to native > > tables, if we have a viable solution or are comfortable with my > > suggestions above I'd be happy to build it out. > > I think the plan you laid out makes sense, especially part (3). Using a > VirtualDruidTable that 'represents' a system table, but hasn't yet actually > built an Iterable, would allow us to defer creation of the > VirtualDataSource til later in the planning process. That'll enable more > kinds of pushdown via rules, like you mentioned. Coupling that with an > in-memory sort for the Scan query — appropriately guarded — I think would > get us all the way there. Later on, as a separate project, we'll want to > extend the Scan ordering to scale beyond what can be done in memory, but > that wouldn't be necessary for the system tables. > > Btw, I would suggest not removing the bindable stuff completely. I did that > in my branch, but I regret it, since I think the change is risky enough > that it should be an option that is opt-in for a while. We could rip out > the old stuff after a few releases, once we're happy with the stability of > the new mechanism. > > What do you think? > > On Fri, May 14, 2021 at 2:51 PM Jason Koch <jk...@netflix.com.invalid> > wrote: > > > @Julian - thank you for review & confirming. > > > > Hi Clint > > > > Thank you, I appreciate the response. I have responded Inline, some > > q's, I've also written in my words as a confirmation that I understand > > ... > > > > > In the mid term, I think that some of us have been thinking that moving > > > system tables into the Druid native query engine is the way to go, and > > have > > > been working on resolving a number of hurdles that are required to make > > > this happen. One of the main motivators to do this is so that we have > > just > > > the Druid query path in the planner in the Calcite layer, and > deprecating > > > and eventually dropping the "bindable" path completely, described in > > > https://github.com/apache/druid/issues/9896. System tables would be > > pushed > > > into Druid Datasource implementations, and queries would be handled in > > the > > > native engine. Gian has even made a prototype of what this might look > > like, > > > > > > https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native > > > since much of the ground work is now in place, though it takes a > > hard-line > > > approach of completely removing bindable instead of hiding it behind a > > > flag, and doesn't implement all of the system tables yet, at least last > > > time I looked at it. > > > > Looking over the changes it seems that: > > - a new VirtualDataSource is introduced, which the Druid non-sql > > processing engine can process, that can wrap an Iterable. This exposes > > lazy segment & iterable using InlineDataSource. > > - the SegmentsTable has been converted from a ScannableTable to a > > DruidTable, and a ScannableTableIterator is introduced to generate an > > iterable containing the rows; the new VirtualDataSource can be used to > > access the rows of this table. > > - finally, the Bindable convention is discarded from DruidPlanner and > > Rules. > > > > > I think there are a couple of remaining parts to resolve that would > make > > > this feasible. The first is native scan queries need support for > ordering > > > by arbitrary columns, instead of just time, so that we can retain > > > capabilities of the existing system tables. > > > > It seems you want to use the native queries to support ordering; do > > you mean here the underlying SegmentsTable, or something in the Druid > > engine? Currently, the SegmentsTable etc relies on, as you say, the > > bindable convention to provide sort. If it was a DruidTable then it > > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery, > > which conceptually is able to do a sort, but as described in [1] [2] > > the ordering is not supported by the underlying druid engine [3]. > > > > This would mean that an order by, sort, limit query would not be > > supported on any of the migrated sys.* tables until Druid has a way to > > perform the sort on a ScanQuery. > > > > [1] > > > https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering > > [2] > > > https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078 > > [3] > > > https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java > > > > > This isn't actually a blocker > > > for adding native system table queries, but rather a blocker for > > replacing > > > the bindable convention by default so that there isn't a loss (or > rather > > > trade) of functionality. Additionally, I think there is maybe some > > matters > > > regarding authorization of system tables when handled by the native > > engine > > > that will need resolved, but this can be done while adding the native > > > implementations. > > > > It looks like the port of the tables from classic ScannableTable to a > > DruidTable itself is straightforward. However, it seems this PR > > doesn't bring them across from SQL domain to be available in any > > native queries. I'm not sure if this is expected or an interim step or > > if I have misunderstood the goal. > > > > > I think there are some various ideas and experiments underway of how to > > do > > > sorting on scan queries at normal Druid datasource scale, which is sort > > of > > > a big project, but in the short term we might be able to do something > > less > > > ambitious that works well enough at system tables scale to allow this > > plan > > > to fully proceed. > > > > One possible way, that I think leads in the correct direction: > > 1) We have an existing rule for LogicalTable with DruidTable to > > DruidQueryRel which can eventually construct a DruidQuery. > > 2) The VirtualDataSource, created during SQL parsing takes an > > already-constructed Iterable; so, we need to have already performed > > the filter/sort before creating the VirtualDataSource (and > > DruidQuery). This means the push-down filter logic has to happen > > during sql/ stage setup and before handoff to processing/ engine. > > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a > > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and > > push down? Then, our SegmentTable and friends can expose the correct > > Iterable. This should allow us to solve the perf concerns, and would > > allow us to present a correctly constructed VirtualDataSource. Sort > > from SQL _should_ be supported (I think) as the planner can push the > > sort etc down to these nodes directly. > > > > In this, the majority of the work would have had to have happened > > prior to Druid engine, in sql/, before reaching Druid and so Druid > > core doesn't actually need to know anything about these changes. > > > > On the other hand, whilst it keeps the pathway open, I'm not sure this > > does any of the actual work to make the sys.* tables available as > > native tables. If we are to try and make these into truly native > > tables, without a native sort, and remove their implementation from > > sql/, the DruidQuery in the planner would need to be configured to > > pass the ScanQuery sort to the processing engine _but only for sys.* > > tables_ and then processing engine would need to know how to find > > these tables. (I haven't explored this). As you mention, implementing > > native sort across multiple data sources seems like a more ambitious > > piece of work. > > > > As another idea, we could consider creating a bridge > > Bindable/EnumerableToDruid rule that would allow druid to embed these > > tables, and move them out of sql/ into processing/, exposed as > > Iterable/Enumerable, and make them available in queries if that is a > > goal. I'm not really sure that adds anything to the overall goals > > though. > > > > > Does this approach make sense? I don't believe Gian is actively working > > on > > > this at the moment, so I think if you're interested in moving along > this > > > approach and want to start laying the groundwork I'm happy to provide > > > guidance and help out. > > > > > > > I am interested. For my current work, I do want to keep focus on the > > sys.* performance work. If there's a way to do it and lay the > > groundwork or even get all the work done, then I am 100% for that. > > Looking at what you want to do to convert these sys.* to native > > tables, if we have a viable solution or are comfortable with my > > suggestions above I'd be happy to build it out. > > > > Thanks > > Jason > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org > > For additional commands, e-mail: dev-h...@druid.apache.org > > > > >