Jesus Camacho Rodriguez created HIVE-14468:
----------------------------------------------
Summary: Implement Druid query based input format
Key: HIVE-14468
URL: https://issues.apache.org/jira/browse/HIVE-14468
Project: Hive
Issue Type: Sub-task
Components: Druid integration
Affects Versions: 2.2.0
Reporter: Jesus Camacho Rodriguez
Assignee: Jesus Camacho Rodriguez
It is responsible of generating the splits and creating the record readers.
* For *Timeseries*, *TopN*, *GroupBy* queries. Create a single split containing
the broker address and the query. Then the record reader will submit the query
to the broker, retrieve the results, and parse them and generate records.
* For *Select* queries. Druid has the concept of threshold (limit) in Select
query. In fact, it is used for retrieving the query results in multiple
requests. Hence, we will emit a Druid Segment Metadata query to obtain the
number of rows in the datasource. Then we create _number of rows /
default\_threshold_ splits; _default\_threshold_ is a Hive configuration
property defined as {{hive.druid.select.threshold}}. Each split generated
contains the broker address and a Select JSON query with _start_ and _end_ row.
The splits are handled independently by the record readers, which submit the
query to the broker, retrieve the results, and parse them and generate records.
This way we can parallelize the retrieval of results for these queries.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)