Thanks everyone! Converting to partitioned parquet really did the trick. I ended up partitioning by day and hour, and while the queries aren’t instantaneous, they aren’t slow either.
> On Oct 30, 2017, at 14:21, Andries Engelbrecht <aengelbre...@mapr.com> wrote: > > As Ted touched on - CSV requires a lot of text parsing and data type > conversion, while parquet doesn’t. This typically saves a ton of CPU when you > query, even if you can’t leverage partitioning or columnar advantages. > > --Andries > > > > On 10/30/17, 10:57 AM, "Saurabh Mahapatra" <saurabhmahapatr...@gmail.com> > wrote: > > As Kunal points out, Parquet and partitioning should help. The concept of > partitioning is not new to warehousing type of queries. To get performance, > you aggregate and partition by the granularity of the query you are > seeking. For example, by day, by week, by month or even quarters. The goal > is to avoid on the fly aggregation for speed. > > It is not always true that a columnar format will outperform a row-based > format in all situations. If you are doing point-access based analytics > where the selectivity of your filter is very very high and you want access > to all columns, row-based formats may actually do better. > > Data layout and query characteristics are the yin-yang of performance for > any query engine on massive scale data. They need to match for the problem > you are solving. > > Best, > Saurabh > > On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua <kkha...@mapr.com> wrote: > >> I second Ted's suggestion! >> >> Since we haven't seen what your profile's operator overview, we can't say >> for sure why the performance isn't good. >> >> On the top of my head ,these are most likely things happening that make >> your performance so bad: >> >> 1. All the CSV files are being read and rows rejected because there is no >> way for Drill to understand the which segments of data have the relevant >> time ranges that you might be looking at. >> 2. Your CSV data has many columns, but you only care about a few... CSV >> readers will need to process the irrelevant ones too. >> 3. There is a cost to reading and casting/converting the data into a >> date/time format. >> >> So, as Ted suggested, writing as a parquet file will give you the most >> bang for the buck. >> Partitioning on, say, a date helps.. but you also don't want it too >> granular. >> >> Last but not the least, if you are doing a query of the form.. >> select X,Y,Z where time between <startTime> and <endTime> >> you will benefit immensely from the data being sorted with that time field. >> >> Hope that helps. >> >> ~ Kunal >> >> -----Original Message----- >> From: Ted Dunning [mailto:ted.dunn...@gmail.com] >> Sent: Monday, October 30, 2017 9:34 AM >> To: user <user@drill.apache.org> >> Subject: Re: Drill performance question >> >> Also, on a practical note, Parquet will likely crush CSV on performance. >> Columnar. Compressed. Binary. All that. >> >> >> >> On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra < >> saurabhmahapatr...@gmail.com> wrote: >> >>> Hi Charles, >>> >>> Can you share some query patterns on this data? More specifically, the >>> number of columns you retrieving out of the total, the filter on the >>> time dimension itself (ranges and granularities) >>> >>> How much is ad hoc and how much is not. >>> >>> Best, >>> Saurabh >>> >>> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre <cgi...@gmail.com> wrote: >>> >>>> Hello all, >>>> I have a dataset consisting of about 16 GB of CSV files. I am >>>> looking to do some time series analysis of this data, and created a >>>> view but when I started doing aggregate queries using components of >>>> the date, the performance was disappointing. Would it be better to >>>> do a CTAS and partition by components of the date? If so, would >>>> parquet be the best format? >>>> Would anyone have other suggestions of things I could do to improve >>>> performance? >>>> Thanks, >>>> — C >>> >> > >