Thanks everyone!
Converting to partitioned parquet really did the trick.   I ended up 
partitioning by day and hour, and while the queries aren’t instantaneous, they 
aren’t slow either. 


> On Oct 30, 2017, at 14:21, Andries Engelbrecht <aengelbre...@mapr.com> wrote:
> 
> As Ted touched on - CSV requires a lot of text parsing and data type 
> conversion, while parquet doesn’t. This typically saves a ton of CPU when you 
> query, even if you can’t leverage partitioning or columnar advantages.
> 
> --Andries
> 
> 
> 
> On 10/30/17, 10:57 AM, "Saurabh Mahapatra" <saurabhmahapatr...@gmail.com> 
> wrote:
> 
>    As Kunal points out, Parquet and partitioning should help. The concept of
>    partitioning is not new to warehousing type of queries. To get performance,
>    you aggregate and partition by the granularity of the query you are
>    seeking. For example, by day, by week, by month or even quarters. The goal
>    is to avoid on the fly aggregation for speed.
> 
>    It is not always true that a columnar format will outperform a row-based
>    format in all situations. If you are doing point-access based analytics
>    where the selectivity of your filter is very very high and you want access
>    to all columns, row-based formats may actually do better.
> 
>    Data layout and query characteristics are the yin-yang of performance for
>    any query engine on massive scale data. They need to match for the problem
>    you are solving.
> 
>    Best,
>    Saurabh
> 
>    On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua <kkha...@mapr.com> wrote:
> 
>> I second Ted's suggestion!
>> 
>> Since we haven't seen what your profile's operator overview, we can't say
>> for sure why the performance isn't good.
>> 
>> On the top of my head ,these are most likely things happening that make
>> your performance so bad:
>> 
>> 1. All the CSV files are being read and rows rejected because there is no
>> way for Drill to understand the which segments of data have the relevant
>> time ranges that you might be looking at.
>> 2. Your CSV data has many columns, but you only care about a few... CSV
>> readers will need to process the irrelevant ones too.
>> 3. There is a cost to reading and casting/converting the data into a
>> date/time format.
>> 
>> So, as Ted suggested, writing as a parquet file will give you the most
>> bang for the buck.
>> Partitioning on, say, a date helps.. but you also don't want it too
>> granular.
>> 
>> Last but not the least, if you are doing a query of the form..
>> select X,Y,Z where time between <startTime> and <endTime>
>> you will benefit immensely from the data being sorted with that time field.
>> 
>> Hope that helps.
>> 
>> ~ Kunal
>> 
>> -----Original Message-----
>> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>> Sent: Monday, October 30, 2017 9:34 AM
>> To: user <user@drill.apache.org>
>> Subject: Re: Drill performance question
>> 
>> Also, on a practical note, Parquet will likely crush CSV on performance.
>> Columnar. Compressed. Binary.  All that.
>> 
>> 
>> 
>> On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
>> saurabhmahapatr...@gmail.com> wrote:
>> 
>>> Hi Charles,
>>> 
>>> Can you share some query patterns on this data? More specifically, the
>>> number of columns you retrieving out of the total, the filter on the
>>> time dimension itself (ranges and granularities)
>>> 
>>> How much is ad hoc and how much is not.
>>> 
>>> Best,
>>> Saurabh
>>> 
>>> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgi...@gmail.com> wrote:
>>> 
>>>> Hello all,
>>>> I have a dataset consisting of about 16 GB of CSV files.  I am
>>>> looking to do some time series analysis of this data, and created a
>>>> view but when I started doing aggregate queries using components of
>>>> the date, the performance was disappointing.  Would it be better to
>>>> do a CTAS and partition by components of the date?  If so, would
>>>> parquet be the best format?
>>>> Would anyone have other suggestions of things I could do to improve
>>>> performance?
>>>> Thanks,
>>>> — C
>>> 
>> 
> 
> 

Reply via email to