Re: Merging files

2016-06-23 Thread Jinfeng Ni
This hash_distribute option should only matter when you have CTAS "partition by". If you do not do partition in CTAS, there should be no impact at all (in theory). Essentially, this option is to re-distribute the data according to the partition key, before Drill writes to target tables. See DRIL

Re: Is this normal view behavior?

2016-06-23 Thread Ted Dunning
On Thu, Jun 23, 2016 at 12:33 PM, John Omernik wrote: > Am I over thinking minutia again? :) > > I htink that this counts as basics, not minutiae.

Correcting columns which can contain multiple types

2016-06-23 Thread Ben Huntley
Hi Drill users, I'm having an issue with an academic dataset that I am attempting to process with Drill 1.6.0. The dataset contains a field which can either be represented as a double, or as a boolean, which is causing queries to fail. To reproduce this error, use this query: select * from hdf

Re: Is this normal view behavior?

2016-06-23 Thread rahul challapalli
I couldn't reproduce the problem as well. Tried with both csv files and parquet files. Can you point us to the commit which you are using? I am curious to know how you ended up seeing "_DEFAULT_COL_TO_READ_" :) - Rahul On Thu, Jun 23, 2016 at 3:15 PM, Jinfeng Ni wrote: > Tried on a commit on 1.

Re: Is this normal view behavior?

2016-06-23 Thread Jinfeng Ni
Tried on a commit on 1.7.0-SNAPSHOT. Looks like I could not re-produce the problem. Which version are u using? create view dfs.tmp.myview as select dir0 as p_day, l_partkey, l_orderkey, l_suppkey from dfs.tmp.t2; +---+-+ | ok |

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
yes tables/stats/iad/mmdd is the structure and each of those directories is a day so when I run a query on tables/stats/iad/201604* I am running it on the whole month of april 2016. By refreshing metadata for tables/stats/iad, it is trying to do it for all of the data which goes back until j

Re: Drill taking way too long to plan query

2016-06-23 Thread Neeraja Rentachintala
What is the partition/directory structure of your data. Is this by day? On Thu, Jun 23, 2016 at 1:48 PM, Tanmay Solanki wrote: > Yeah, so I tried to cache the metadata for 1 day earlier and saw that that > greatly improved the performance. Then I tried doing this for a full 1 > month of data but

Re: Discussion: Comments in Drill Views

2016-06-23 Thread Ted Dunning
This is very interesting. I love docstrings in Lisp and Python and Javadoc in Java. Basically this is like that, but for SQL. Very helpful. On Thu, Jun 23, 2016 at 11:48 AM, John Omernik wrote: > I am looking for discussion here. A colleague was asking me how to add > comments to the metadata

Question about querying array fields in parquet files

2016-06-23 Thread David Kincaid
I'm very new to Drill and just learning how everything works. I had a question about a query when one of the fields in an array (or list) of values. To simplify, I have a Parquet file of records where each record has just two fields. "name" is a string value and "lastName" is an array of strings. T

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
Yeah, so I tried to cache the metadata for 1 day earlier and saw that that greatly improved the performance. Then I tried doing this for a full 1 month of data but unfortunately that was not allowed. I had to cache run "refresh table metadata" on the full iad folder since it does it by directory

Re: Drill taking way too long to plan query

2016-06-23 Thread Neeraja Rentachintala
You might want to enable metadata caching and see if it helps. https://drill.apache.org/docs/optimizing-parquet-metadata-reading/ On Thu, Jun 23, 2016 at 1:36 PM, Tanmay Solanki wrote: > Below is the plan. The amount of files is ~213000 files of parquet data. > > 0: jdbc:drill:> explain plan f

Re: Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
Below is the plan. The amount of files is ~213000 files of parquet data. 0: jdbc:drill:> explain plan for select count(*) from s3.`tables/stats/iad/201604*/`; +--+--+ |   text | json | +--+--+ | 00-00    Screen 00-01  Project(EXPR$0=[$0]) 00-02    Project(EXPR$0=[$0]

Re: Is this normal view behavior?

2016-06-23 Thread Neeraja Rentachintala
This is a bug. On Thu, Jun 23, 2016 at 1:32 PM, rahul challapalli < challapallira...@gmail.com> wrote: > This looks like a bug. If you renamed the dir0 column as p_day, then you > should see that in sqlline as well. And I have never seen > "_DEFAULT_COL_TO_READ_" > before. Can you file a jira? >

Re: Is this normal view behavior?

2016-06-23 Thread rahul challapalli
This looks like a bug. If you renamed the dir0 column as p_day, then you should see that in sqlline as well. And I have never seen "_DEFAULT_COL_TO_READ_" before. Can you file a jira? - Rahul On Thu, Jun 23, 2016 at 12:33 PM, John Omernik wrote: > I have a table that is a directory of parquet f

Is this normal view behavior?

2016-06-23 Thread John Omernik
I have a table that is a directory of parquet files, each row had say 3 columns, and the table is split into subdirectories that allow me to use dir0 partitioning. so if I select * from `table` I get col1, col2, col3, and dir0 as my fields returned. So if I create a view CREATE VIEW view_myview

Re: Merging files

2016-06-23 Thread John Omernik
It's basically a two level grouping that has a LEFT JOIN so select a.field1, a.field2, sum(b.somefield) as new_thing from table1 a LEFT JOIN table2 b on a.id = b.id where a.field1 = '2015-05-05' group by a.field1, b.field2 It's not a very complicated query, but it doesn't like the hash_distribut

Discussion: Comments in Drill Views

2016-06-23 Thread John Omernik
I am looking for discussion here. A colleague was asking me how to add comments to the metadata of a view. (He's new to Drill, thus the idea of not having metadata for a table is one he's warming up to). That got me thinking... why couldn't we use Drill Views to store table/field comments? This

Re: Merging files

2016-06-23 Thread Jinfeng Ni
I looked at the code. 1. Drill did log this long CanNotPlan msg in error level. 2) It was replaced with a much shorter version msg only when CanNotPlan was caused by cartesian join. I guess your query probably did not have cartesian join, and CanNotPlan was caused by other reasons. Either way, I t

Re: Drill taking way too long to plan query

2016-06-23 Thread Ted Dunning
Also, how many files? What format? Being so slow is an anomaly. On Thu, Jun 23, 2016 at 11:15 AM, Khurram Faraaz wrote: > Can you please share the query plan for that long running query here ? > > On Thu, Jun 23, 2016 at 11:40 PM, Tanmay Solanki < > tsolank...@yahoo.in.invalid> wrote: > > >

Re: Drill taking way too long to plan query

2016-06-23 Thread Khurram Faraaz
Can you please share the query plan for that long running query here ? On Thu, Jun 23, 2016 at 11:40 PM, Tanmay Solanki < tsolank...@yahoo.in.invalid> wrote: > I am trying to run a query on Apache drill to simply count the number of > rows in a table stored in parquet format in S3. I am running t

Re: Merging files

2016-06-23 Thread John Omernik
Unfortunatly, the 14 mb error message contains to much proprietary information for me to post to a Jira, the query itself may also be a bit to revealing.This is Drill 1.6, so maybe the issue isn't fixed in my version? do you know the original JIRA for the really long error? On Thu, Jun 23, 2016 at

Drill taking way too long to plan query

2016-06-23 Thread Tanmay Solanki
I am trying to run a query on Apache drill to simply count the number of rows in a table stored in parquet format in S3. I am running this on a 20 node r3.8xlarge EC2 instance cluster and I have my direct memory set to 80GB, heap memory set to 32GB and set the planner.memory.max_memory_per_node

Re: Merging files

2016-06-23 Thread Jinfeng Ni
This "CannotPlanException" definitely is a bug in query planner. I thought we had put code to show that extremely long error msg "only" in debug mode. Looks like it's not that case. Could you please open a JIRA and post your query, if possible? thx. On Thu, Jun 23, 2016 at 10:45 AM, John Omernik

Re: Merging files

2016-06-23 Thread Ted Dunning
On Thu, Jun 23, 2016 at 10:41 AM, John Omernik wrote: > First of all, I feel like that as a "perhaps knows just enough to be > dangerous" power/intermediate user, it barely registered that I should, for > optimal performance, do something about 400 files. I saw them and almost > moved on before

Re: Merging files

2016-06-23 Thread John Omernik
Jinfeng - I wrote my item prior to reading yours. Just an FYI, when I ran with that settting, I got a "CannotPlanException" (with an error that is easily the longest "non-verbose"( heck this beats all the verbose errors I've had) I've ever seen. I'd post it here, but I am not unsure if my Google h

Re: Merging files

2016-06-23 Thread John Omernik
This worked perfectly. Thanks Jason. (It also made my small 1.5m per day table into 600K per day... so double win) So, I like this approach and will use it. It makes sense how it works, but obviously this is something I had to come to the Drill User group for, and you, being an expert on Drill in

Re: Merging files

2016-06-23 Thread Jinfeng Ni
Do you partition by day in your CTAS? If that's the case, CTAS will produce at least one parquet file for each value of "day". If you have 100 days, then you will end up at least 100 files. However, in case the query is executed in distributed mode, there could be more than one file per value. In

Re: Merging files

2016-06-23 Thread Jason Altekruse
Apply a sort in your CTAS, this will force the data down to a single stream before writing. Jason Altekruse Software Engineer at Dremio Apache Drill Committer On Thu, Jun 23, 2016 at 10:23 AM, John Omernik wrote: > When have a small query writing smaller data (like aggregate tables for > faster

Merging files

2016-06-23 Thread John Omernik
When have a small query writing smaller data (like aggregate tables for faster aggregates for Dashboards etc). It appears to write a ton of small files. Not sure why, maybe its just how the join worked out etc. I have a "day" that is 1.5M in total size, but 400 files total. This seems excessive.

Re: Issue with join query having same storage plugin name and same datatype

2016-06-23 Thread Magnus Pierre
Please look at the plan generated by Drill: explain plan with implementation for SELECT… (An idea is to test the generated SQL in the plan in the DB to see what it does) One idea is to try to put one of the tables in a sub select table where you actively rename the column. WITH s0 as ( SELECT

Issue with join query having same storage plugin name and same datatype

2016-06-23 Thread SanjiV SwaraJ
I want to join two table using same storage plugin. But One Of the Column showing null value.I am using drill in window 8 system with latest version of drill(i.e;1.6) . I am using this query:- SELECT T2.ID AS T_ID,T1.ID AS T1_ID ,T1.ProviderID AS ProviderID,

Re: Pivot in Apache Drill

2016-06-23 Thread Sanjiv Kumar
I have tables : Sl No Name Status Time Taken 1 Description In Progress 2 2 StockNumber In Progress 3 3 SpecSheet Completed 3 I want to display all the row of NAME column(i.e:- Description. StockNumber, SpecSheet) as Column name in SQL SERVER using Drill. EXAMPLE:- Description St