What is the most memory-efficient technique for selecting several million records from a CSV file

Gareth Western Thu, 22 Oct 2020 23:27:30 -0700

I have a very large CSV file (nearly 13 million records) stored in Azure 
Storage and read via the Azure Storage plugin. The drillbit configuration has a 
modest 4GB heap size. Is there an effective way to select all the records from 
the file without running out of resources in Drill?


SELECT * … is too big

SELECT * with OFFSET and LIMIT sounds like the right approach, but OFFSET still 
requires scanning through the offset records, and this seems to hit the same 
memory issues even with small LIMITs once the offset is large enough.

Would it help to switch the format to something other than CSV? Or move it to a 
different storage mechanism? Or something else?

What is the most memory-efficient technique for selecting several million records from a CSV file

Reply via email to