Re: [HANGOUT] Topics for 10/31/2017

2017-10-30 Thread Timothy Farkas
I'll speak about unit testing:


 - Common mistakes that were made in unit tests

 - The soon to be merged temp directory test watcher classes

 - Making our Travis build run smoke tests with code coverage

 - Other misc improvements to be made

Thanks,
Tim


From: Gautam Parai 
Sent: Monday, October 30, 2017 9:26:34 AM
To: d...@drill.apache.org; user@drill.apache.org
Subject: [HANGOUT] Topics for 10/31/2017

Hi,

We will have a Drill hangout tomorrow (Tuesday Oct 31) at 10 AM Pacific Time. 
Please suggest topics by replying to this thread or bring them up during the 
hangout.

Hangout link:  
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks,
Gautam


Re: Drill performance question

2017-10-30 Thread Andries Engelbrecht
As Ted touched on - CSV requires a lot of text parsing and data type 
conversion, while parquet doesn’t. This typically saves a ton of CPU when you 
query, even if you can’t leverage partitioning or columnar advantages.

--Andries



On 10/30/17, 10:57 AM, "Saurabh Mahapatra"  wrote:

As Kunal points out, Parquet and partitioning should help. The concept of
partitioning is not new to warehousing type of queries. To get performance,
you aggregate and partition by the granularity of the query you are
seeking. For example, by day, by week, by month or even quarters. The goal
is to avoid on the fly aggregation for speed.

It is not always true that a columnar format will outperform a row-based
format in all situations. If you are doing point-access based analytics
where the selectivity of your filter is very very high and you want access
to all columns, row-based formats may actually do better.

Data layout and query characteristics are the yin-yang of performance for
any query engine on massive scale data. They need to match for the problem
you are solving.

Best,
Saurabh

On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua  wrote:

> I second Ted's suggestion!
>
> Since we haven't seen what your profile's operator overview, we can't say
> for sure why the performance isn't good.
>
> On the top of my head ,these are most likely things happening that make
> your performance so bad:
>
> 1. All the CSV files are being read and rows rejected because there is no
> way for Drill to understand the which segments of data have the relevant
> time ranges that you might be looking at.
> 2. Your CSV data has many columns, but you only care about a few... CSV
> readers will need to process the irrelevant ones too.
> 3. There is a cost to reading and casting/converting the data into a
> date/time format.
>
> So, as Ted suggested, writing as a parquet file will give you the most
> bang for the buck.
> Partitioning on, say, a date helps.. but you also don't want it too
> granular.
>
> Last but not the least, if you are doing a query of the form..
> select X,Y,Z where time between  and 
> you will benefit immensely from the data being sorted with that time 
field.
>
> Hope that helps.
>
> ~ Kunal
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Monday, October 30, 2017 9:34 AM
> To: user 
> Subject: Re: Drill performance question
>
> Also, on a practical note, Parquet will likely crush CSV on performance.
> Columnar. Compressed. Binary.  All that.
>
>
>
> On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
>
> > Hi Charles,
> >
> > Can you share some query patterns on this data? More specifically, the
> > number of columns you retrieving out of the total, the filter on the
> > time dimension itself (ranges and granularities)
> >
> > How much is ad hoc and how much is not.
> >
> > Best,
> > Saurabh
> >
> > On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:
> >
> > > Hello all,
> > > I have a dataset consisting of about 16 GB of CSV files.  I am
> > > looking to do some time series analysis of this data, and created a
> > > view but when I started doing aggregate queries using components of
> > > the date, the performance was disappointing.  Would it be better to
> > > do a CTAS and partition by components of the date?  If so, would
> > > parquet be the best format?
> > > Would anyone have other suggestions of things I could do to improve
> > > performance?
> > > Thanks,
> > > — C
> >
>




Re: Drill performance question

2017-10-30 Thread Saurabh Mahapatra
As Kunal points out, Parquet and partitioning should help. The concept of
partitioning is not new to warehousing type of queries. To get performance,
you aggregate and partition by the granularity of the query you are
seeking. For example, by day, by week, by month or even quarters. The goal
is to avoid on the fly aggregation for speed.

It is not always true that a columnar format will outperform a row-based
format in all situations. If you are doing point-access based analytics
where the selectivity of your filter is very very high and you want access
to all columns, row-based formats may actually do better.

Data layout and query characteristics are the yin-yang of performance for
any query engine on massive scale data. They need to match for the problem
you are solving.

Best,
Saurabh

On Mon, Oct 30, 2017 at 10:46 AM, Kunal Khatua  wrote:

> I second Ted's suggestion!
>
> Since we haven't seen what your profile's operator overview, we can't say
> for sure why the performance isn't good.
>
> On the top of my head ,these are most likely things happening that make
> your performance so bad:
>
> 1. All the CSV files are being read and rows rejected because there is no
> way for Drill to understand the which segments of data have the relevant
> time ranges that you might be looking at.
> 2. Your CSV data has many columns, but you only care about a few... CSV
> readers will need to process the irrelevant ones too.
> 3. There is a cost to reading and casting/converting the data into a
> date/time format.
>
> So, as Ted suggested, writing as a parquet file will give you the most
> bang for the buck.
> Partitioning on, say, a date helps.. but you also don't want it too
> granular.
>
> Last but not the least, if you are doing a query of the form..
> select X,Y,Z where time between  and 
> you will benefit immensely from the data being sorted with that time field.
>
> Hope that helps.
>
> ~ Kunal
>
> -Original Message-
> From: Ted Dunning [mailto:ted.dunn...@gmail.com]
> Sent: Monday, October 30, 2017 9:34 AM
> To: user 
> Subject: Re: Drill performance question
>
> Also, on a practical note, Parquet will likely crush CSV on performance.
> Columnar. Compressed. Binary.  All that.
>
>
>
> On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
> saurabhmahapatr...@gmail.com> wrote:
>
> > Hi Charles,
> >
> > Can you share some query patterns on this data? More specifically, the
> > number of columns you retrieving out of the total, the filter on the
> > time dimension itself (ranges and granularities)
> >
> > How much is ad hoc and how much is not.
> >
> > Best,
> > Saurabh
> >
> > On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:
> >
> > > Hello all,
> > > I have a dataset consisting of about 16 GB of CSV files.  I am
> > > looking to do some time series analysis of this data, and created a
> > > view but when I started doing aggregate queries using components of
> > > the date, the performance was disappointing.  Would it be better to
> > > do a CTAS and partition by components of the date?  If so, would
> > > parquet be the best format?
> > > Would anyone have other suggestions of things I could do to improve
> > > performance?
> > > Thanks,
> > > — C
> >
>


RE: Drill performance question

2017-10-30 Thread Kunal Khatua
I second Ted's suggestion!

Since we haven't seen what your profile's operator overview, we can't say for 
sure why the performance isn't good. 

On the top of my head ,these are most likely things happening that make your 
performance so bad:

1. All the CSV files are being read and rows rejected because there is no way 
for Drill to understand the which segments of data have the relevant time 
ranges that you might be looking at. 
2. Your CSV data has many columns, but you only care about a few... CSV readers 
will need to process the irrelevant ones too. 
3. There is a cost to reading and casting/converting the data into a date/time 
format. 

So, as Ted suggested, writing as a parquet file will give you the most bang for 
the buck. 
Partitioning on, say, a date helps.. but you also don't want it too granular.

Last but not the least, if you are doing a query of the form..
select X,Y,Z where time between  and 
you will benefit immensely from the data being sorted with that time field. 

Hope that helps. 

~ Kunal

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com] 
Sent: Monday, October 30, 2017 9:34 AM
To: user 
Subject: Re: Drill performance question

Also, on a practical note, Parquet will likely crush CSV on performance.
Columnar. Compressed. Binary.  All that.



On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra < 
saurabhmahapatr...@gmail.com> wrote:

> Hi Charles,
>
> Can you share some query patterns on this data? More specifically, the 
> number of columns you retrieving out of the total, the filter on the 
> time dimension itself (ranges and granularities)
>
> How much is ad hoc and how much is not.
>
> Best,
> Saurabh
>
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:
>
> > Hello all,
> > I have a dataset consisting of about 16 GB of CSV files.  I am 
> > looking to do some time series analysis of this data, and created a 
> > view but when I started doing aggregate queries using components of 
> > the date, the performance was disappointing.  Would it be better to 
> > do a CTAS and partition by components of the date?  If so, would 
> > parquet be the best format?
> > Would anyone have other suggestions of things I could do to improve 
> > performance?
> > Thanks,
> > — C
>


Re: looking for dotnet drivers to access the drill

2017-10-30 Thread Saurabh Mahapatra
Hi Ram,

I am curious if you are trying to access Drill from a .NET application. Is
this a C# or Java app? Not sure about your specifics but you most probably
can use the ODBC driver. See this:

https://msdn.microsoft.com/en-us/library/ms228366(v=vs.90).aspx

I do not know enough about what you are doing. So please share any more
information you may have.

Best,
Saurabh

On Sat, Oct 28, 2017 at 5:21 AM, Duraisamy, Ram 
wrote:

>
>
> Regards
> Ram
> Stanley Black & Decker, Inc.
> www.stanleyblackanddecker.com
> 701 East Joppa Rd., Towson, MD 21286
> Business :   410-716-3167 Ext 23167
> Mobile:  410-790-0677
>
>


looking for dotnet drivers to access the drill

2017-10-30 Thread Duraisamy, Ram


Regards
Ram
Stanley Black & Decker, Inc.
www.stanleyblackanddecker.com
701 East Joppa Rd., Towson, MD 21286
Business :   410-716-3167 Ext 23167
Mobile:  410-790-0677



Re: Drill performance question

2017-10-30 Thread Charles Givre
The data itself contains 6 or so columns:  date, user_id, city, state, lat, 
long.   I’m looking to aggregate by week, by day of week etc.  So the general 
pattern would look something like:

SELECT EXTRACT( day FROM `date` ) AS _`day`, 
COUNT( DISTINCT id ) as distinct_id, 
COUNT( id ) as total_id
FROM 
GROUP BY EXTRACT( day FROM `date` )
ORDER BY `day ASC

The view I created cast all the fields to the appropriate data types. 
— C

> On Oct 30, 2017, at 12:30, Saurabh Mahapatra  
> wrote:
> 
> Hi Charles,
> 
> Can you share some query patterns on this data? More specifically, the
> number of columns you retrieving out of the total, the filter on the time
> dimension itself (ranges and granularities)
> 
> How much is ad hoc and how much is not.
> 
> Best,
> Saurabh
> 
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:
> 
>> Hello all,
>> I have a dataset consisting of about 16 GB of CSV files.  I am looking to
>> do some time series analysis of this data, and created a view but when I
>> started doing aggregate queries using components of the date, the
>> performance was disappointing.  Would it be better to do a CTAS and
>> partition by components of the date?  If so, would parquet be the best
>> format?
>> Would anyone have other suggestions of things I could do to improve
>> performance?
>> Thanks,
>> — C



Re: Drill performance question

2017-10-30 Thread Ted Dunning
Also, on a practical note, Parquet will likely crush CSV on performance.
Columnar. Compressed. Binary.  All that.



On Mon, Oct 30, 2017 at 9:30 AM, Saurabh Mahapatra <
saurabhmahapatr...@gmail.com> wrote:

> Hi Charles,
>
> Can you share some query patterns on this data? More specifically, the
> number of columns you retrieving out of the total, the filter on the time
> dimension itself (ranges and granularities)
>
> How much is ad hoc and how much is not.
>
> Best,
> Saurabh
>
> On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:
>
> > Hello all,
> > I have a dataset consisting of about 16 GB of CSV files.  I am looking to
> > do some time series analysis of this data, and created a view but when I
> > started doing aggregate queries using components of the date, the
> > performance was disappointing.  Would it be better to do a CTAS and
> > partition by components of the date?  If so, would parquet be the best
> > format?
> > Would anyone have other suggestions of things I could do to improve
> > performance?
> > Thanks,
> > — C
>


Re: Drill performance question

2017-10-30 Thread Saurabh Mahapatra
Hi Charles,

Can you share some query patterns on this data? More specifically, the
number of columns you retrieving out of the total, the filter on the time
dimension itself (ranges and granularities)

How much is ad hoc and how much is not.

Best,
Saurabh

On Mon, Oct 30, 2017 at 9:27 AM, Charles Givre  wrote:

> Hello all,
> I have a dataset consisting of about 16 GB of CSV files.  I am looking to
> do some time series analysis of this data, and created a view but when I
> started doing aggregate queries using components of the date, the
> performance was disappointing.  Would it be better to do a CTAS and
> partition by components of the date?  If so, would parquet be the best
> format?
> Would anyone have other suggestions of things I could do to improve
> performance?
> Thanks,
> — C


Drill performance question

2017-10-30 Thread Charles Givre
Hello all, 
I have a dataset consisting of about 16 GB of CSV files.  I am looking to do 
some time series analysis of this data, and created a view but when I started 
doing aggregate queries using components of the date, the performance was 
disappointing.  Would it be better to do a CTAS and partition by components of 
the date?  If so, would parquet be the best format?  
Would anyone have other suggestions of things I could do to improve performance?
Thanks,
— C

[HANGOUT] Topics for 10/31/2017

2017-10-30 Thread Gautam Parai
Hi,

We will have a Drill hangout tomorrow (Tuesday Oct 31) at 10 AM Pacific Time. 
Please suggest topics by replying to this thread or bring them up during the 
hangout.

Hangout link:  
https://plus.google.com/hangouts/_/event/ci4rdiju8bv04a64efj5fedd0lc

Thanks,
Gautam