Hi Gian

It looks like I have a solution that works, and I'd like to present it for
your thoughts.

https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified

At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it
checks with the underlying datasource whether a scan is supported, and if
it is, it constructs a ScanQuery to match and passes it on. Later,
ScanQueryEngine will look at whether the adapter is a SortedCursorFactory
and if it is, it will request the cursors to be constructed and pass
through all relevent sort/limit/filter to the underlying table logic.
VirtualTable implementation can then perform its magic and return a result!

The solution is:
- Create a new VirtualDataSource, a new VirtualSegment and
VirtualStorageAdapter, as well as a new SortedCursorFactory to support the
VirtualDataSource.
- System tables have a DruidVirtualTableRule that will convert a
LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing
the query authentication information.
- DataSource gets a function canScanOrdered so that it can advise the Query
parser whether it can handle a given sort function - default is that only
existing time-ordering is supported. Implementations (ie:
VirtualDataSource) can decide to accept an ordered scan or not.

Overall, I think this provides a viable solution and meets your goals. In
rebasing this morning I see you're on the same pathway with ordered scan
query, so I could rebase on top of that and break into a smaller set of
PRs, nonetheless the conceptual approach and direction is something that I
think will work.

Thanks!
Jason






On Wed, May 19, 2021 at 9:54 PM Gian Merlino <g...@apache.org> wrote:

> Hey Jason,
>
> It sounds like we have two different, but related goals:
>
> 1) Your goal is to improve the performance of system tables.
>
> 2) My goal with the branch Clint linked is to enable using Druid's native
> query engine for system tables, in order to achieve consistency in how SQL
> queries are executed and also so all of Druid's special functions,
> aggregations, extensions, etc, are available to use in system tables.
>
> Two notes about what I'm trying to do, btw, in response to things you
> raised. First, I'm interested in using Druid's native query engine under
> the hood, but I'm not necessarily driving towards being able to actually
> query system tables using native Druid queries. It still achieves my goals
> if these tables are only available in SQL queries. Second, you're correct
> that for this to work for queries with 'order by' but no 'group by', we'd
> need to add ordering support to the Scan query.
>
> That's actually the main reason I stopped working on this branch: I started
> looking into Scan ordering instead. Then I got distracted with other stuff,
> and now I'm working on neither of those things 🙂. Anyway, I think it'll be
> somewhat involved to implement Scan ordering in a scalable way for any
> possible query on any possible datasource, but if we're focused on sys
> tables, we can take a shortcut that is less-scalable. It wouldn't be tough
> to make something that works for anything that works today, since the
> bindable convention we use today simply does the sort in memory (see
> org.apache.calcite.interpreter.SortNode). That might be a good way to
> unblock the sys-table-via-native-engine work. We'd just need some safeguard
> to prevent that code from executing on real datasources that are too big to
> materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
> using an undocumented Scan context parameter set by the SQL layer only for
> sys tables.
>
> > I am interested. For my current work, I do want to keep focus on the
> > sys.* performance work. If there's a way to do it and lay the
> > groundwork or even get all the work done, then I am 100% for that.
> >  Looking at what you want to do to convert these sys.* to native
> > tables, if we have a viable solution or are comfortable with my
> > suggestions above I'd be happy to build it out.
>
> I think the plan you laid out makes sense, especially part (3). Using a
> VirtualDruidTable that 'represents' a system table, but hasn't yet actually
> built an Iterable, would allow us to defer creation of the
> VirtualDataSource til later in the planning process. That'll enable more
> kinds of pushdown via rules, like you mentioned. Coupling that with an
> in-memory sort for the Scan query — appropriately guarded — I think would
> get us all the way there. Later on, as a separate project, we'll want to
> extend the Scan ordering to scale beyond what can be done in memory, but
> that wouldn't be necessary for the system tables.
>
> Btw, I would suggest not removing the bindable stuff completely. I did that
> in my branch, but I regret it, since I think the change is risky enough
> that it should be an option that is opt-in for a while. We could rip out
> the old stuff after a few releases, once we're happy with the stability of
> the new mechanism.
>
> What do you think?
>
> On Fri, May 14, 2021 at 2:51 PM Jason Koch <jk...@netflix.com.invalid>
> wrote:
>
> > @Julian - thank you for review & confirming.
> >
> > Hi Clint
> >
> > Thank you, I appreciate the response. I have responded Inline, some
> > q's, I've also written in my words as a confirmation that I understand
> > ...
> >
> > > In the mid term, I think that some of us have been thinking that moving
> > > system tables into the Druid native query engine is the way to go, and
> > have
> > > been working on resolving a number of hurdles that are required to make
> > > this happen. One of the main motivators to do this is so that we have
> > just
> > > the Druid query path in the planner in the Calcite layer, and
> deprecating
> > > and eventually dropping the "bindable" path completely, described in
> > > https://github.com/apache/druid/issues/9896. System tables would be
> > pushed
> > > into Druid Datasource implementations, and queries would be handled in
> > the
> > > native engine. Gian has even made a prototype of what this might look
> > like,
> > >
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > > since much of the ground work is now in place, though it takes a
> > hard-line
> > > approach of completely removing bindable instead of hiding it behind a
> > > flag, and doesn't implement all of the system tables yet, at least last
> > > time I looked at it.
> >
> > Looking over the changes it seems that:
> > - a new VirtualDataSource is introduced, which the Druid non-sql
> > processing engine can process, that can wrap an Iterable. This exposes
> > lazy segment & iterable using  InlineDataSource.
> > - the SegmentsTable has been converted from a ScannableTable to a
> > DruidTable, and a ScannableTableIterator is introduced to generate an
> > iterable containing the rows; the new VirtualDataSource can be used to
> > access the rows of this table.
> > - finally, the Bindable convention is discarded from DruidPlanner and
> > Rules.
> >
> > > I think there are a couple of remaining parts to resolve that would
> make
> > > this feasible. The first is native scan queries need support for
> ordering
> > > by arbitrary columns, instead of just time, so that we can retain
> > > capabilities of the existing system tables.
> >
> > It seems you want to use the native queries to support ordering; do
> > you mean here the underlying SegmentsTable, or something in the Druid
> > engine? Currently, the SegmentsTable etc relies on, as you say, the
> > bindable convention to provide sort. If it was a DruidTable then it
> > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
> > which conceptually is able to do a sort, but as described in [1] [2]
> > the ordering is not supported by the underlying druid engine [3].
> >
> > This would mean that an order by, sort, limit query would not be
> > supported on any of the migrated sys.* tables until Druid has a way to
> > perform the sort on a ScanQuery.
> >
> > [1]
> >
> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
> > [2]
> >
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
> > [3]
> >
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
> >
> > > This isn't actually a blocker
> > > for adding native system table queries, but rather a blocker for
> > replacing
> > > the bindable convention by default so that there isn't a loss (or
> rather
> > > trade) of functionality. Additionally, I think there is maybe some
> > matters
> > > regarding authorization of system tables when handled by the native
> > engine
> > > that will need resolved, but this can be done while adding the native
> > > implementations.
> >
> > It looks like the port of the tables from classic ScannableTable to a
> > DruidTable itself is straightforward. However, it seems this PR
> > doesn't bring them across from SQL domain to be available in any
> > native queries. I'm not sure if this is expected or an interim step or
> > if I have misunderstood the goal.
> >
> > > I think there are some various ideas and experiments underway of how to
> > do
> > > sorting on scan queries at normal Druid datasource scale, which is sort
> > of
> > > a big project, but in the short term we might be able to do something
> > less
> > > ambitious that works well enough at system tables scale to allow this
> > plan
> > > to fully proceed.
> >
> > One possible way, that I think leads in the correct direction:
> > 1) We have an existing rule for LogicalTable with DruidTable to
> > DruidQueryRel which can eventually construct a DruidQuery.
> > 2) The VirtualDataSource, created during SQL parsing takes an
> > already-constructed Iterable; so, we need to have already performed
> > the filter/sort before creating the VirtualDataSource (and
> > DruidQuery). This means the push-down filter logic has to happen
> > during sql/ stage setup and before handoff to processing/ engine.
> > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
> > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
> > push down? Then, our SegmentTable and friends can expose the correct
> > Iterable. This should allow us to solve the perf concerns, and would
> > allow us to present a correctly constructed VirtualDataSource. Sort
> > from SQL _should_ be supported (I think) as the planner can push the
> > sort etc down to these nodes directly.
> >
> > In this, the majority of the work would have had to have happened
> > prior to Druid engine, in sql/, before reaching Druid and so Druid
> > core doesn't actually need to know anything about these changes.
> >
> > On the other hand, whilst it keeps the pathway open, I'm not sure this
> > does any of the actual work to make the sys.* tables available as
> > native tables. If we are to try and make these into truly native
> > tables, without a native sort, and remove their implementation from
> > sql/, the DruidQuery in the planner would need to be configured to
> > pass the ScanQuery sort to the processing engine _but only for sys.*
> > tables_ and then processing engine would need to know how to find
> > these tables. (I haven't explored this). As you mention, implementing
> > native sort across multiple data sources seems like a more ambitious
> > piece of work.
> >
> > As another idea, we could consider creating a bridge
> > Bindable/EnumerableToDruid rule that would allow druid to embed these
> > tables, and move them out of sql/ into processing/, exposed as
> > Iterable/Enumerable, and make them available in queries if that is a
> > goal. I'm not really sure that adds anything to the overall goals
> > though.
> >
> > > Does this approach make sense? I don't believe Gian is actively working
> > on
> > > this at the moment, so I think if you're interested in moving along
> this
> > > approach and want to start laying the groundwork I'm happy to provide
> > > guidance and help out.
> > >
> >
> > I am interested. For my current work, I do want to keep focus on the
> > sys.* performance work. If there's a way to do it and lay the
> > groundwork or even get all the work done, then I am 100% for that.
> > Looking at what you want to do to convert these sys.* to native
> > tables, if we have a viable solution or are comfortable with my
> > suggestions above I'd be happy to build it out.
> >
> > Thanks
> > Jason
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> > For additional commands, e-mail: dev-h...@druid.apache.org
> >
> >
>

Reply via email to