Hi,
As the project description says, I understand drill as a open source
implementation of Dremel. Basically, Dremel optimizes adhoc queries on
unstructured data by storing it columnar way instead of record wise. I
assume drill doing the same. I saw drill supporting a wide variety of
datasources like json, mongo, etc., How does drill achieve the
transformation of source data into a columnar representation so that it can
optimize the queries?
For Example:
Data [Assume it to be in mongo]:
{"idtype":"ca","id":3,"metric":"purchases","time":"Y14/M0/D0","device":"nexus","devicegrp":"tablet","source":"minewhat","sourcegrp":"email","dofw":"weekend","tofd":"morning","browser":"chrome","engage":"return","location":"mumbai","locationgrp":"maharashtra","usertag":"frequent","search":"sony
tab","total":56263}
And for a query like below:
select test.device, count(*) from mongo.mydata test where test.idtype='b'
and test.id=10 group by test.device, test.idtype, test.id;
Will drill load *all documents* from mydata collection every time this
query is fired and later map the data to columnar style? I'm 100% sure this
won't be the implementation as it look to worsen the situation more
[loading data, transform [should go row by row] and then query the
transformed data].
It would be really helpful if someone can shed some light on this area, as
there is no material found in the documentation.
Regards,
Tamil.s