As Ted touched on - CSV requires a lot of text parsing and data type 
conversion, while parquet doesn’t. This typically saves a ton of CPU when you 
query, even if you can’t leverage partitioning or columnar advantages.

--Andries



On 10/30/17, 10:57 AM, "Saurabh Mahapatra" <saurabhmahapatr...@gmail.com> wrote:

    As Kunal points out, Parquet and partitioning should help. The concept of
    partitioning is not new to warehousing type of queries. To get performance,
    you aggregate and partition by the granularity of the query you are
    seeking. For example, by day, by week, by month or even quarters. The goal
    is to avoid on the fly aggregation for speed.
    
    It is not always true that a columnar format will outperform a row-based
    format in all situations. If you are doing point-access based analytics
    where the selectivity of your filter is very very high and you want access
    to all columns, row-based formats may actually do better.
    
    Data layout and query characteristics are the yin-yang of performance for
    any query engine on massive scale data. They need to match for the problem
    you are solving.
    
    Best,
    Saurabh
    
    On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua <kkha...@mapr.com> wrote:
    
    > I second Ted's suggestion!
    >
    > Since we haven't seen what your profile's operator overview, we can't say
    > for sure why the performance isn't good.
    >
    > On the top of my head ,these are most likely things happening that make
    > your performance so bad:
    >
    > 1. All the CSV files are being read and rows rejected because there is no
    > way for Drill to understand the which segments of data have the relevant
    > time ranges that you might be looking at.
    > 2. Your CSV data has many columns, but you only care about a few... CSV
    > readers will need to process the irrelevant ones too.
    > 3. There is a cost to reading and casting/converting the data into a
    > date/time format.
    >
    > So, as Ted suggested, writing as a parquet file will give you the most
    > bang for the buck.
    > Partitioning on, say, a date helps.. but you also don't want it too
    > granular.
    >
    > Last but not the least, if you are doing a query of the form..
    > select X,Y,Z where time between <startTime> and <endTime>
    > you will benefit immensely from the data being sorted with that time 
field.
    >
    > Hope that helps.
    >
    > ~ Kunal
    >
    > -----Original Message-----
    > From: Ted Dunning [mailto:ted.dunn...@gmail.com]
    > Sent: Monday, October 30, 2017 9:34 AM
    > To: user <user@drill.apache.org>
    > Subject: Re: Drill performance question
    >
    > Also, on a practical note, Parquet will likely crush CSV on performance.
    > Columnar. Compressed. Binary.  All that.
    >
    >
    >
    > On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
    > saurabhmahapatr...@gmail.com> wrote:
    >
    > > Hi Charles,
    > >
    > > Can you share some query patterns on this data? More specifically, the
    > > number of columns you retrieving out of the total, the filter on the
    > > time dimension itself (ranges and granularities)
    > >
    > > How much is ad hoc and how much is not.
    > >
    > > Best,
    > > Saurabh
    > >
    > > On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgi...@gmail.com> wrote:
    > >
    > > > Hello all,
    > > > I have a dataset consisting of about 16 GB of CSV files.  I am
    > > > looking to do some time series analysis of this data, and created a
    > > > view but when I started doing aggregate queries using components of
    > > > the date, the performance was disappointing.  Would it be better to
    > > > do a CTAS and partition by components of the date?  If so, would
    > > > parquet be the best format?
    > > > Would anyone have other suggestions of things I could do to improve
    > > > performance?
    > > > Thanks,
    > > > — C
    > >
    >
    

Reply via email to