Hi, I'm pretty new around here but let me attempt to answer you.
- Parquet will always be (a lot) faster than CSV, especially if your querying for only a part of the columns in the CSV - Parquet is has various compression techniques and is more "scan friendly" (optimized for scanning compressed data) - The optimal filesize is linked to the fs segment sizes (I'm not sure how that effects S3) and block sizes - hava a look at this: http://ingest.tips/2015/01/31/parquet-row-group-size/ - Read up on partitioning of Parquet file that is supported by Drill and can improve your performance quite a bit - partitioning can help you with efficiently filter data and prevents scanning of data not relevant to your query - Spend a little bit of time to plan how your will map your CSV to Parquet to make sure columns are imported as the appropriate data type - this matters in compression and efficiency (storing numbers as string, for example, will prevent Parquet for doing some optimization magick - See this: http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or some of the other presentations on Parquet) - Optimize your drillbits (Drill machines) so they are sharing the workload - Get to know #3 best practices - https://www.youtube.com/watch?v=_FHRzq7eHQc - https://aws.amazon.com/articles/1904 Hope this helps, -Stefan On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <hafizmujadi...@gmail.com> wrote: > Hi! > > I have terabytes of data on S3 and I want to query this data using drill. I > want to know at which format of data drill gives best performance. whether > CSV format will be best or parquet format? Also what should be file size? > whether small files will be more appropriate for drill or large files? > > > Thanks >