gianm opened a new issue #5989: SQL: System tables for introspection
URL: https://github.com/apache/incubator-druid/issues/5989
 
 
   It is common amongst DBMSes to expose system information through special 
system tables. This practice helps people understand the system better and 
provides visibility into what's going on. Druid has a couple already, see: 
http://druid.io/docs/latest/querying/sql.html#retrieving-metadata.
   
   In addition to the INFORMATION_SCHEMA I propose adding a "system" schema 
with :
   
   - `system.segments` -- a row per segment. It should include both segments 
from the metadata store _and_ segments that are served, but not yet published 
(i.e. segments that are served from realtime tasks). It should include the 
DataSegment payload as one of its columns, plus some other useful columns like: 
dataSource, start, end, is_published, is_available, is_realtime.
   - `system.servers` -- a row per server. It should include host, port, and 
server type. I think at first it would be ok to only include data servers here 
(i.e. historicals and peons). Eventually we would want to include all server 
types.
   - `system.server_segments` -- a row per segment per server. The idea is that 
this is a table that can be used to join `segments` with `servers`.
   - `system.tasks` -- a row per task.
   
   The segment-related tables would largely be meant as more user-friendly 
alternatives/replacements to the hodgepodge of REST APIs on 
http://druid.io/docs/latest/design/coordinator.html. The task table would be an 
alternative to the task APIs on the overlord.
   
   For these tables to work, the information needs to be available on the 
broker. Some of this data it already has pre-cached (for example: served 
segments for `system.segments` are in one of its ServerViews, as are all the 
data servers for `system.servers`, and all the server-segment mappings for 
`system.server_segments`). Some of this data the broker does _not_ know today 
(for example: metadata segments for `system.segments` are only known to the 
coordinator, and tasks for `system.tasks` are only known to the overlord).
   
   For the data that the broker already has, the system tables should be wired 
up to it, similar to how INFORMATION_SCHEMA works.
   
   For the data that the broker does _not_ have, we could either pull it in and 
pre-cache it, or we could pull it on demand. For metadata segments I could see 
this going either way. For tasks I think we should not pull and cache -- it 
changes frequently and we don't want the data to be stale (users will likely 
rely on this table to be up-to-the-second). So for tasks we should pull "live" 
from the overlord on each query. There isn't an example of this being done 
today, but it's something Calcite is capable of. We should also endeavor to 
push down filters (like dataSource, task type) where possible.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to