Improving S3 query performance with cache capabilities

DEMOY, Jocelyn Thu, 01 Oct 2015 04:38:31 -0700

Hi all,

I am an architect of a PAAS BI solution and gave apache drill a test during a 
week.


On our current solution, we store our customer data on S3 column files and 
perform in memory computation on a home made nosql engine. We have TB of  data 
on S3, but since it's a multi tenant solution, when a end customer perform 
queries it's only on a subset of data in separated S3 folder with let's say 1 
gb of data max.

We try to have the best response time for real time analytics queries (things 
like 1 sec response time for 500K row aggregation on 4 cols with joins). To do 
so, we load only the necessary columns (we have one file per column, not per 
table) and cache the columns value in local JVM for xx mins.

I built a POC with drill & parquet to replace our computation engine. Local 
execution time are fine and mach our needs. I am quite happy with the 
capability to query on S3 with real SQL syntax. My main problem is the latency 
with S3 (from AWS instances) :  for every query I have to pay the download cost 
of the parquet file from S3, this makes the query response time too long for a 
"real time" solution.

I would like to know if you have plan anything in the roadmap to enable some 
native caching capabilities on the data itself (not only metadata caching).

I saw the AbstractStoragePlugin and AbstractRecordReader classes. Would it be 
possible (and a good idea) for us to create a decorator for the classic file 
provider (or a totally new custom S3 provider) with memory cache capability. 
How would this make sense in a drill cluster and in the drill philosophy ?

Thanks in advance


Jocelyn Demoy
BI Architect, R&D and strategy
Sage

Improving S3 query performance with cache capabilities

Reply via email to