Hi,

I'm pretty new around here but let me attempt to answer you.

   - Parquet will always be (a lot) faster than CSV, especially if your
   querying for only a part of the columns in the CSV
   - Parquet is has various compression techniques and is more "scan
   friendly" (optimized for scanning compressed data)

   - The optimal filesize is linked to the fs segment sizes (I'm not sure
   how that effects S3) and block sizes
   - hava a look at this:
   http://ingest.tips/2015/01/31/parquet-row-group-size/

   - Read up on partitioning of Parquet file that is supported by Drill and
   can improve your performance quite a bit
   - partitioning can help you with efficiently filter data and prevents
   scanning of data not relevant to your query

   - Spend a little bit of time to plan how your will map your CSV to
   Parquet to make sure columns are imported as the appropriate data type
   - this matters in compression and efficiency (storing numbers as string,
   for example, will prevent Parquet for doing some optimization magick
   - See this:
   http://www.slideshare.net/julienledem/th-210pledem?next_slideshow=2 (or
   some of the other presentations on Parquet)

   - Optimize your drillbits (Drill machines) so they are sharing the
   workload

   - Get to know #3 best practices
   - https://www.youtube.com/watch?v=_FHRzq7eHQc
   - https://aws.amazon.com/articles/1904

Hope this helps,
 -Stefan

On Sat, Jul 25, 2015 at 9:08 AM, Hafiz Mujadid <hafizmujadi...@gmail.com>
wrote:

> Hi!
>
> I have terabytes of data on S3 and I want to query this data using drill. I
> want to know at which format of data drill gives best performance. whether
> CSV format will be best or parquet format? Also what should be file size?
> whether small files will be more appropriate for drill or large files?
>
>
> Thanks
>

Reply via email to