Re: DRILL-1257

2015-07-30 Thread Adam Gilmore
nough for your > use cases... and could be a good start to this work). > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Wed, Jul 29, 2015 at 6:44 PM, Adam Gilmore > wrote: > > > Wanted to touch base to see what the status was of DRILL-1257. > > >

Re: Querying partitioned Parquet files

2015-07-29 Thread Adam Gilmore
Just to clarify this, Jason - you don't necessarily need HDFS or the like for this, if you had say a NFS volume (for example, Amazon Elastic File System), you can still accomplish it, right? Or merely if you had all files duplicated on every node locally. On Thu, Jul 30, 2015 at 10:00 AM, Jason A

DRILL-1257

2015-07-29 Thread Adam Gilmore
Wanted to touch base to see what the status was of DRILL-1257. We've run into a few instances where JSON/Mongo data is changing types and Drill is unable to query it (e.g. a numeric type becomes a string type). I know this is a pretty massive change with a lot of tough decisions to make on how to

Re: MapR Drill - mongodb collections does not show up

2015-06-09 Thread Adam Gilmore
nst a cluster of Mongo nodes > holding > 5tb of data using a large number of nodes and threads per node). > > thx, > Jacques > > On Mon, Jun 8, 2015 at 5:06 PM, Adam Gilmore > wrote: > > > Just my input here guys. We experienced the exact same issue due to the > &g

Re: MapR Drill - mongodb collections does not show up

2015-06-08 Thread Adam Gilmore
Sorry - on a side note, I forgot to mention that this only occurred when connecting to a replica set in 3.0. Connecting to a single 3.0 instance did not have the problem. On Tue, Jun 9, 2015 at 10:06 AM, Adam Gilmore wrote: > Just my input here guys. We experienced the exact same issue due

Re: MapR Drill - mongodb collections does not show up

2015-06-08 Thread Adam Gilmore
Just my input here guys. We experienced the exact same issue due to the fact that Drill is still using the 2.x Mongo Java driver. Mongo 3.0's server does not play nicely with this driver (you cannot see any collections). If it does turn out that you're using Mongo 3.0, then you need to be using

Custom UDFS slow

2015-05-26 Thread Adam Gilmore
Hi guys, I have written a couple of custom UDFS (specifically WEEK() and WEEKYEAR() to get that date information out of timestamps). I sampled two queries (on approx. 11 million records in Parquet files) select count(*) from `table` group by extract(day from `timestamp`) 750ms select count(*)

Re: Understanding Drill's timestamp and timezone

2015-05-11 Thread Adam Gilmore
I must say - this is really confusing and seems to be undocumented. I think if Drill is going to not support a timestamp with timezone in the near future, it should deal with ALL date/times as UTC, or at the very least provide functions to convert between the two where applicable. For example, th

Re: Query planning cost

2015-05-07 Thread Adam Gilmore
Yep - it's a tad confusing. As Jacques said, it's definitely running the scans in parallel, but it does seem pretty much linear. On Fri, May 8, 2015 at 10:44 AM, Ted Dunning wrote: > On Fri, May 8, 2015 at 12:30 AM, Adam Gilmore > wrote: > > > We're getting ab

Re: Query planning cost

2015-05-07 Thread Adam Gilmore
eau wrote: > We log for Parquet footer reading and block Map building. What are the > reported times for each in your scenario? Are you on HDFS or MFS? > > Thx > On May 7, 2015 10:47 AM, "Adam Gilmore" wrote: > > > Hey sorry my mistake - you're right. Didn&#x

Re: Query planning cost

2015-05-07 Thread Adam Gilmore
I'll double check the debug logs. We're getting about a 350ms delay for 70 files, about 200ms for 35 files, about 20-30ms for 1 file. We're using HDFS. It does't appear that it's just saturating HDFS with reads, either. Regards, *Adam Gilmore* Director of Technolog

Re: Query planning cost

2015-05-07 Thread Adam Gilmore
a: > > https://issues.apache.org/jira/browse/DRILL-2743 > > Note, I also think Steven has identified some places where we re-get > FileStatus multiple times which can also lead to poorer start performance. > I"m not sure there is an issue open against this but we should get one >

Re: Query planning cost

2015-05-06 Thread Adam Gilmore
, larger files, but still want the benefit of smaller row groups (as I have just done the Parquet pushdown filtering). On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore wrote: > Hi guys, > > I've been looking at the speed of some of our queries and have noticed > there is quite a sig

Query planning cost

2015-05-06 Thread Adam Gilmore
Hi guys, I've been looking at the speed of some of our queries and have noticed there is quite a significant delay to the query actually starting. For example, querying about 70 Parquet files in a directory, it takes about 370ms before it starts the first fragment. Obviously, considering it's no

Mongo query speed

2015-05-05 Thread Adam Gilmore
Hi guys, I know there was recently a patch around Mongo slowness with regards to a bug in the reader; however, the querying is still fairly slow when compared to Mongo's aggregation framework itself (in our tests 5-10 times slower). My guess is this is due to the fact we serialize BSON to JSON an

Parquet pushdown filtering and cost

2015-04-27 Thread Adam Gilmore
Hi guys, I have started implementing a Parquet pushdown filtering optimizer rule and have made significant progress. Using some of the Mongo pushdown filtering code, I was able to quickly convert logical expressions into proper Parquet filter2 API expressions. The issue is, because the "old" (or

New Drillbits joining cluster causes severe performance spike

2015-04-21 Thread Adam Gilmore
Hey guys, I'm troubleshooting some issues with our cluster under some production load and scaling. If we add new drillbits to a cluster, as soon as it joins the cluster, performance degrades severely (queries that usually take 1s would take 60s, for example). After a few minutes, it recovers jus

Query plan changes based on field names

2015-04-17 Thread Adam Gilmore
Hi guys, I raised: https://issues.apache.org/jira/browse/DRILL-2732 I'd be keen to get stuck in and implement a patch for this, but I was hoping someone might be able to point me in the right direction. The behaviour seems extremely strange that a field name could affect a query plan. Ultimate

Re: Drill favouring a particular Drillbit

2015-04-16 Thread Adam Gilmore
mentation probably > wouldn't be that difficult (assuming you keep it node-level as opposed to > cluster level). We merged the auto shuffling per session so let us know > how that looks. > > On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore > wrote: > > > The workload does invo

Re: Drill favouring a particular Drillbit

2015-04-15 Thread Adam Gilmore
s (the last merge > is on the foreman node), work should be reasonably distributed. > > On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore > wrote: > > > Looks like this definitely is the following bug: > > > > https://issues.apache.org/jira/browse/DRILL-2512 > > &g

Re: Drill favouring a particular Drillbit

2015-04-12 Thread Adam Gilmore
Means it's nearly impossible for us to scale out. On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore wrote: > Anyone have any more thoughts on this? Anywhere I can start trying to > troubleshoot? > > On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore > wrote: > >> So there are

Re: Counting large numbers of unique values

2015-04-08 Thread Adam Gilmore
Ted - I'd be really interested in doing something like that (approximate aggregation results). This would be very interesting in terms of standard deviation, median, etc. I know there is another project out there that trades off speed vs accuracy (the name of which escapes me). If we could easil

Re: Drill favouring a particular Drillbit

2015-04-07 Thread Adam Gilmore
Anyone have any more thoughts on this? Anywhere I can start trying to troubleshoot? On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore wrote: > So there are 5 Parquet files, each ~125mb - not sure what I can provide re > the block locations? I believe it's under the HDFS block s

Re: Query performance and clustering

2015-03-25 Thread Adam Gilmore
; You also may want to experiment with planner.width.max_per_query. > > I have not looked into the queue mechanisms in detail yet, but it > doesn’t seem that the cluster is having issues with how it is managing > concurrency. > > > > Keep in mind AWS can be inconsistent in

Re: Query performance and clustering

2015-03-25 Thread Adam Gilmore
Do I need debug logging for this or? Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pharmacy www.PharmaData.net.au <http://www.pharmadata.net.au/>

Re: Drill favouring a particular Drillbit

2015-03-25 Thread Adam Gilmore
was submitted to. But I'm still not sure it's related > to DRILL-2512. > > I'll wait for your additional info before speculating further. > > On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore > wrote: > > > We actually setup a separate load balancer for port 80

Re: Drill favouring a particular Drillbit

2015-03-25 Thread Adam Gilmore
when looking at the query profiles? Is the node that is > being > > > > hammered the foreman for the queries and most of the major fragments > > are > > > > tied to the foreman? > > > > > > > > —Andries > > > > >

Query performance and clustering

2015-03-25 Thread Adam Gilmore
Hi all, I'm doing some testing on query performance, especially in a clustered environment. The test data is 5 Parquet files with 2.2 million records in each file (total of ~11m). The cluster is an Amazon EMR cluster with a total of 10 drillbits (c3.xlarge instances). A single SUM() with a GROU

Drill favouring a particular Drillbit

2015-03-25 Thread Adam Gilmore
Hi guys, I'm trying to understand how this could be possible. I have a Hadoop cluster of a name node and two data nodes setup. All have identical specs in terms of CPU/RAM etc. The two data nodes have a replicated HDFS setup where I'm storing some Parquet files. A Drill cluster (with Zookeeper

Re: Storage Plugin Config for XML

2015-03-01 Thread Adam Gilmore
I would imagine you'd have to read all XML as a string unless an XSD was provided, which would allow you to infer the types. Still be easy enough to cast to the types you need, similar to JSON in the all text mode. On Wed, Feb 25, 2015 at 5:41 PM, Ted Dunning wrote: > > To help with this, I jus

Re: Storage Plugin Config for XML

2015-03-01 Thread Adam Gilmore
I would imagine you'd have to read all XML as a string unless an XSD was provided, which would allow you to infer the types. Still be easy enough to cast to the types you need, similar to JSON in the all text mode. Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au

Re: Using CTAS with nested structures

2015-02-23 Thread Adam Gilmore
I submitted a patch for reading all JSON numbers as doubles; however, it'd probably be nice to extend that to specify a default to read as anything. Something like ... alter session set `store.json.read_numbers_as` = 'DECIMAL(5,2)'; would be useful. On Tue, Feb 24, 2015 at 9:32 AM, Steven Phill

Memory error

2015-02-19 Thread Adam Gilmore
Hi guys, I've just suddenly started getting memory errors: 0: jdbc:drill:zk=local> create table dfs.tmp.purchases4 as (select price, quantity from mongo.`connect`.`events`); ++---+ | Fragment | Number of records written | ++---

Re: Directory partitions slower than scan all events

2015-02-18 Thread Adam Gilmore
projection (e.g. select/group/etc.) then it is not included in the scan. On Thu, Feb 19, 2015 at 4:09 PM, Adam Gilmore wrote: > Hi guys, > > I'm trying to understand something about directory partitions and how > they're implemented. > > For sake of basic argument, I ha

Directory partitions slower than scan all events

2015-02-18 Thread Adam Gilmore
Hi guys, I'm trying to understand something about directory partitions and how they're implemented. For sake of basic argument, I have ~3 mil rows in 3 separate Parquet files. Each one has a "groupId" of 1, 2 and 3 respectively. I then place them in separate directories named 1, 2 and 3. The f

MongoDB provider

2015-02-16 Thread Adam Gilmore
​Hi guys, I was having a look at the MongoDB provider and was wondering if this is normal. If I issue a basic: select * from mongo.db.collection; I end up with a single column coming back (named *) with the entire document in it. Of course, if I select individual fields, then this works fine,

Complex Parquet reading

2015-01-15 Thread Adam Gilmore
Hi all, I have a question on the complex Parquet reader. I note in code that if anything in the Parquet file in non-primitive, it falls back to the other Parquet reader. I've also noted, when this happens, no matter how many drill bits I have, it doesn't seem to parallelize the processing - or a

Re: Parquet and filtering

2015-01-11 Thread Adam Gilmore
ely. I can certainly meet in a hangout or > just answer questions via e-mail if you needed help navigating the current > code. > > -Jason > > On Thu, Jan 8, 2015 at 10:27 PM, Adam Gilmore > wrote: > > > What about starting with something simple? > > > > For example

Re: Parquet and filtering

2015-01-08 Thread Adam Gilmore
group/page. What do you think? Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pharmacy www.PharmaData.net.au <http://www.pharmadata.net.au

Re: Parquet and filtering

2015-01-08 Thread Adam Gilmore
r performance in the > case of full table scans on nested/repeated data. > > - Jason > > On Thu, Jan 8, 2015 at 7:45 AM, Jacques Nadeau wrote: > > > That is correct. > > > > On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore > > wrote: > > > > > T

Re: Filter by objectId field in Mongo

2015-01-08 Thread Adam Gilmore
ustomerId.`$oid` = > '54901607f10c2236769f7b3b' limit 1; > > as well as: > > select customerId from mongo.`connect`.events e where e.customerId.`$oid` = > '54901607f10c2236769f7b3b' limit 1; > > > > On Thu, Jan 8, 2015 at 5:26 PM, A

Re: Filter by objectId field in Mongo

2015-01-08 Thread Adam Gilmore
stomerId.`$oid` = > '54901607f10c2236769f7b3b' limit 1; > > Thanks, > Jacques > > On Wed, Jan 7, 2015 at 12:44 AM, Adam Gilmore > wrote: > > > Unfortunately, that didn't work. I tried: > > > > select * from mongo.`connect`.events w

Re: Filter by objectId field in Mongo

2015-01-08 Thread Adam Gilmore
| ++ 1 row selected (0.261 seconds) Strange results there - I played around with the second query and it seems to be able to return anything but * nicely. So you're probably right. Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (M

Re: Can't query parquet on HDFS

2015-01-07 Thread Adam Gilmore
large? > > thanks, > Jacques > > On Tue, Jan 6, 2015 at 9:29 PM, Adam Gilmore > wrote: > > > Anyone got any ideas on this one? I can consistently reproduce the issue > > with HDFS - the minute I get the data off HDFS (to a local drive), it all > > works fine. >

Re: Parquet and filtering

2015-01-07 Thread Adam Gilmore
n Thu, Jan 8, 2015 at 1:57 PM, Adam Gilmore wrote: > That makes a lot of sense. Just one question with regarding to handling > complex types - do you mean maps/arrays/etc. (repetitions in Parquet)? As > in, if I created a Parquet table from some JSON files with a rather > complex/nes

Re: Parquet and filtering

2015-01-07 Thread Adam Gilmore
, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pharmacy www.PharmaData.net.au <http://www.pharmadata.net.au/> [image: pharmadata-sig] *Disclaimer*

Re: Parquet and filtering

2015-01-07 Thread Adam Gilmore
ust made one, I put some comments there from the design discussions we > have had in the past. > > https://issues.apache.org/jira/browse/DRILL-1950 > > - Jason Altekruse > > On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore > wrote: > > > Just a quick follow up on this - i

Re: Can't query parquet on HDFS

2015-01-07 Thread Adam Gilmore
-d185-4101-b466-1dd231808a9d on ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ] [ 91b9e166-d185-4101-b466-1dd231808a9d on ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ] See: https://www.dropbox.com/s/akdyfxb98q5adxg/saletest3.tgz?dl=0 On Wed, Jan 7, 2015 at 6:42 PM, Adam Gilmore wrote

Re: Mongo performance very slow

2015-01-07 Thread Adam Gilmore
ld return very > quickly. Will also be interesting to compare query plans in Drill. > > —Andries > > > On Jan 6, 2015, at 4:08 PM, Adam Gilmore wrote: > > > Hi all, > > > > I'm trying to test out Mongo with Drill but seem to be running into very > &

Re: Filter by objectId field in Mongo

2015-01-07 Thread Adam Gilmore
> > Note that currently Drill uses SQL expressions with dotted notation > extensions for filters and doesn't currently support the mongodb based json > object filters. > > On Tue, Jan 6, 2015 at 3:52 PM, Adam Gilmore > wrote: > > > Hi Kamesh, > > > > U

Re: Can't query parquet on HDFS

2015-01-07 Thread Adam Gilmore
ing? Is the file something you can share > privately or publically or is too large? > > thanks, > Jacques > > On Tue, Jan 6, 2015 at 9:29 PM, Adam Gilmore > wrote: > > > Anyone got any ideas on this one? I can consistently reproduce the issue > > with HDFS - the m

Re: Can't query parquet on HDFS

2015-01-07 Thread Adam Gilmore
et me know how I can assist in reproducing. Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pharmacy www.PharmaData.net.au <http://www.pharmadata.net.au/>

Re: Parquet and filtering

2015-01-06 Thread Adam Gilmore
ads. > > -Jason > > > > On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore > wrote: > > > Hi guys, > > > > I have a question re Parquet. I'm not sure if this is a Drill question > or > > Parquet, but thought I'd start here. > > >

Re: Can't query parquet on HDFS

2015-01-06 Thread Adam Gilmore
Anyone got any ideas on this one? I can consistently reproduce the issue with HDFS - the minute I get the data off HDFS (to a local drive), it all works fine. Doesn't seem to be a problem with Parquet - more like the HDFS storage engine. On Tue, Jan 6, 2015 at 9:50 AM, Adam Gilmore

Mongo performance very slow

2015-01-06 Thread Adam Gilmore
Hi all, I'm trying to test out Mongo with Drill but seem to be running into very slow performance. I have about 1M documents loaded into Mongo, and I'm doing something as simple as: select count(*) from mongo.`connect`.events group by collection; where "collection" is a string field in the docu

Re: Filter by objectId field in Mongo

2015-01-06 Thread Adam Gilmore
Hi Kamesh, Unfortunately it's not on _id - it's on another objectId field we have in our documents. That seems to work fine with _id but with anything else, it returns no results. Any thoughts? Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 6

Re: Can't query parquet on HDFS

2015-01-05 Thread Adam Gilmore
The data is okay, because the exact same Parquet directory is working fine on the local drive, it's just not working when using HDFS. I tried casting as you said, but that ended up with the exact same problem. Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +6

Re: Parquet and filtering

2015-01-05 Thread Adam Gilmore
Hi Jason, Understood - so currently Drill doesn't do predicate pushdown for Parquet? Regards, *Adam Gilmore* Director of Technology a...@pharmadata.net.au +61 421 997 655 (Mobile) 1300 733 876 (AU) +617 3171 9902 (Intl) *PharmaData* Data Intelligence Solutions for Pha

Re: Can't query parquet on HDFS

2015-01-05 Thread Adam Gilmore
ata type. Also please verify that all the > column data is satisfying your data type. > > Sudhakar Thota > Sent from my iPhone > > > On Jan 5, 2015, at 5:56 AM, Adam Gilmore wrote: > > > > The actual stack trace is: > > > > 2015-01-05 13:48:27,356 [2b55

Re: Parquet and filtering

2015-01-05 Thread Adam Gilmore
alue copies as we filtered out > the records that were not needed. This currently takes place in a separate > filter operator and should be pushed down into the read operation to make > use of the file meta-data and eliminate some of the reads. > > -Jason > > > > On Mon,

Parquet and filtering

2015-01-05 Thread Adam Gilmore
Hi guys, I have a question re Parquet. I'm not sure if this is a Drill question or Parquet, but thought I'd start here. I have a sample dataset of ~100M rows in a Parquet file. It's quick to sum a single column across the whole dataset. I have a column which has approx 100 unique values (e.g.

Re: Can't query parquet on HDFS

2015-01-05 Thread Adam Gilmore
et-format-2.1.1-drill-r1.jar:na] at parquet.format.Util.read(Util.java:47) ~[parquet-format-2.1.1-drill-r1.jar:na] ... 21 common frames omitted On Mon, Jan 5, 2015 at 6:26 PM, Adam Gilmore wrote: > Hi all, > > I'm trying to do a really simple query on a parquet dire

Can't query parquet on HDFS

2015-01-05 Thread Adam Gilmore
Hi all, I'm trying to do a really simple query on a parquet directory on HDFS. This works fine: select count(*) from hdfs.warehouse.saleparquet However, this fails: 0: jdbc:drill:local> select sum(sellprice) from hdfs.warehouse.saleparquet; Query failed: Query failed: Failure while running fra

Filter by objectId field in Mongo

2015-01-05 Thread Adam Gilmore
Hi all, I'm trying to work out how to filter by an objectId field using the Mongo plugin. I've tried many combinations of = '{$oid: ''id''}' etc. but nothing seems to work. Is this implemented yet? If not, is there a JIRA item for it already?