nough for your
> use cases... and could be a good start to this work).
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Wed, Jul 29, 2015 at 6:44 PM, Adam Gilmore
> wrote:
>
> > Wanted to touch base to see what the status was of DRILL-1257.
> >
>
Just to clarify this, Jason - you don't necessarily need HDFS or the like
for this, if you had say a NFS volume (for example, Amazon Elastic File
System), you can still accomplish it, right? Or merely if you had all
files duplicated on every node locally.
On Thu, Jul 30, 2015 at 10:00 AM, Jason A
Wanted to touch base to see what the status was of DRILL-1257.
We've run into a few instances where JSON/Mongo data is changing types and
Drill is unable to query it (e.g. a numeric type becomes a string type).
I know this is a pretty massive change with a lot of tough decisions to
make on how to
nst a cluster of Mongo nodes
> holding > 5tb of data using a large number of nodes and threads per node).
>
> thx,
> Jacques
>
> On Mon, Jun 8, 2015 at 5:06 PM, Adam Gilmore
> wrote:
>
> > Just my input here guys. We experienced the exact same issue due to the
> &g
Sorry - on a side note, I forgot to mention that this only occurred when
connecting to a replica set in 3.0. Connecting to a single 3.0 instance
did not have the problem.
On Tue, Jun 9, 2015 at 10:06 AM, Adam Gilmore wrote:
> Just my input here guys. We experienced the exact same issue due
Just my input here guys. We experienced the exact same issue due to the
fact that Drill is still using the 2.x Mongo Java driver. Mongo 3.0's
server does not play nicely with this driver (you cannot see any
collections).
If it does turn out that you're using Mongo 3.0, then you need to be using
Hi guys,
I have written a couple of custom UDFS (specifically WEEK() and WEEKYEAR()
to get that date information out of timestamps).
I sampled two queries (on approx. 11 million records in Parquet files)
select count(*) from `table` group by extract(day from `timestamp`)
750ms
select count(*)
I must say - this is really confusing and seems to be undocumented.
I think if Drill is going to not support a timestamp with timezone in the
near future, it should deal with ALL date/times as UTC, or at the very
least provide functions to convert between the two where applicable.
For example, th
Yep - it's a tad confusing. As Jacques said, it's definitely running the
scans in parallel, but it does seem pretty much linear.
On Fri, May 8, 2015 at 10:44 AM, Ted Dunning wrote:
> On Fri, May 8, 2015 at 12:30 AM, Adam Gilmore
> wrote:
>
> > We're getting ab
eau wrote:
> We log for Parquet footer reading and block Map building. What are the
> reported times for each in your scenario? Are you on HDFS or MFS?
>
> Thx
> On May 7, 2015 10:47 AM, "Adam Gilmore" wrote:
>
> > Hey sorry my mistake - you're right. Didn
I'll double check the debug logs.
We're getting about a 350ms delay for 70 files, about 200ms for 35 files,
about 20-30ms for 1 file.
We're using HDFS.
It does't appear that it's just saturating HDFS with reads, either.
Regards,
*Adam Gilmore*
Director of Technolog
a:
>
> https://issues.apache.org/jira/browse/DRILL-2743
>
> Note, I also think Steven has identified some places where we re-get
> FileStatus multiple times which can also lead to poorer start performance.
> I"m not sure there is an issue open against this but we should get one
>
, larger files, but still want the benefit of
smaller row groups (as I have just done the Parquet pushdown filtering).
On Thu, May 7, 2015 at 4:08 PM, Adam Gilmore wrote:
> Hi guys,
>
> I've been looking at the speed of some of our queries and have noticed
> there is quite a sig
Hi guys,
I've been looking at the speed of some of our queries and have noticed
there is quite a significant delay to the query actually starting.
For example, querying about 70 Parquet files in a directory, it takes about
370ms before it starts the first fragment.
Obviously, considering it's no
Hi guys,
I know there was recently a patch around Mongo slowness with regards to a
bug in the reader; however, the querying is still fairly slow when compared
to Mongo's aggregation framework itself (in our tests 5-10 times slower).
My guess is this is due to the fact we serialize BSON to JSON an
Hi guys,
I have started implementing a Parquet pushdown filtering optimizer rule and
have made significant progress. Using some of the Mongo pushdown filtering
code, I was able to quickly convert logical expressions into proper Parquet
filter2 API expressions.
The issue is, because the "old" (or
Hey guys,
I'm troubleshooting some issues with our cluster under some production load
and scaling.
If we add new drillbits to a cluster, as soon as it joins the cluster,
performance degrades severely (queries that usually take 1s would take 60s,
for example). After a few minutes, it recovers jus
Hi guys,
I raised:
https://issues.apache.org/jira/browse/DRILL-2732
I'd be keen to get stuck in and implement a patch for this, but I was
hoping someone might be able to point me in the right direction.
The behaviour seems extremely strange that a field name could affect a
query plan. Ultimate
mentation probably
> wouldn't be that difficult (assuming you keep it node-level as opposed to
> cluster level). We merged the auto shuffling per session so let us know
> how that looks.
>
> On Wed, Apr 15, 2015 at 4:35 PM, Adam Gilmore
> wrote:
>
> > The workload does invo
s (the last merge
> is on the foreman node), work should be reasonably distributed.
>
> On Sun, Apr 12, 2015 at 10:29 PM, Adam Gilmore
> wrote:
>
> > Looks like this definitely is the following bug:
> >
> > https://issues.apache.org/jira/browse/DRILL-2512
> >
&g
Means it's nearly
impossible for us to scale out.
On Wed, Apr 8, 2015 at 3:58 PM, Adam Gilmore wrote:
> Anyone have any more thoughts on this? Anywhere I can start trying to
> troubleshoot?
>
> On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore
> wrote:
>
>> So there are
Ted - I'd be really interested in doing something like that (approximate
aggregation results). This would be very interesting in terms of standard
deviation, median, etc.
I know there is another project out there that trades off speed vs accuracy
(the name of which escapes me). If we could easil
Anyone have any more thoughts on this? Anywhere I can start trying to
troubleshoot?
On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore wrote:
> So there are 5 Parquet files, each ~125mb - not sure what I can provide re
> the block locations? I believe it's under the HDFS block s
; You also may want to experiment with planner.width.max_per_query.
> > I have not looked into the queue mechanisms in detail yet, but it
> doesn’t seem that the cluster is having issues with how it is managing
> concurrency.
> >
> > Keep in mind AWS can be inconsistent in
Do I need debug logging for this or?
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (Mobile)
1300 733 876 (AU)
+617 3171 9902 (Intl)
*PharmaData*
Data Intelligence Solutions for Pharmacy
www.PharmaData.net.au <http://www.pharmadata.net.au/>
was submitted to. But I'm still not sure it's related
> to DRILL-2512.
>
> I'll wait for your additional info before speculating further.
>
> On Wed, Mar 25, 2015 at 6:54 PM, Adam Gilmore
> wrote:
>
> > We actually setup a separate load balancer for port 80
when looking at the query profiles? Is the node that is
> being
> > > > hammered the foreman for the queries and most of the major fragments
> > are
> > > > tied to the foreman?
> > > >
> > > > —Andries
> > > >
>
Hi all,
I'm doing some testing on query performance, especially in a clustered
environment.
The test data is 5 Parquet files with 2.2 million records in each file
(total of ~11m).
The cluster is an Amazon EMR cluster with a total of 10 drillbits
(c3.xlarge instances).
A single SUM() with a GROU
Hi guys,
I'm trying to understand how this could be possible. I have a Hadoop
cluster of a name node and two data nodes setup. All have identical specs
in terms of CPU/RAM etc.
The two data nodes have a replicated HDFS setup where I'm storing some
Parquet files.
A Drill cluster (with Zookeeper
I would imagine you'd have to read all XML as a string unless an XSD was
provided, which would allow you to infer the types. Still be easy enough
to cast to the types you need, similar to JSON in the all text mode.
On Wed, Feb 25, 2015 at 5:41 PM, Ted Dunning wrote:
>
> To help with this, I jus
I would imagine you'd have to read all XML as a string unless an XSD was
provided, which would allow you to infer the types. Still be easy enough
to cast to the types you need, similar to JSON in the all text mode.
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
I submitted a patch for reading all JSON numbers as doubles; however, it'd
probably be nice to extend that to specify a default to read as anything.
Something like ...
alter session set `store.json.read_numbers_as` = 'DECIMAL(5,2)';
would be useful.
On Tue, Feb 24, 2015 at 9:32 AM, Steven Phill
Hi guys,
I've just suddenly started getting memory errors:
0: jdbc:drill:zk=local> create table dfs.tmp.purchases4 as (select price,
quantity from mongo.`connect`.`events`);
++---+
| Fragment | Number of records written |
++---
projection (e.g. select/group/etc.) then it is not
included in the scan.
On Thu, Feb 19, 2015 at 4:09 PM, Adam Gilmore wrote:
> Hi guys,
>
> I'm trying to understand something about directory partitions and how
> they're implemented.
>
> For sake of basic argument, I ha
Hi guys,
I'm trying to understand something about directory partitions and how
they're implemented.
For sake of basic argument, I have ~3 mil rows in 3 separate Parquet
files. Each one has a "groupId" of 1, 2 and 3 respectively.
I then place them in separate directories named 1, 2 and 3.
The f
Hi guys,
I was having a look at the MongoDB provider and was wondering if this is
normal.
If I issue a basic:
select * from mongo.db.collection;
I end up with a single column coming back (named *) with the entire
document in it.
Of course, if I select individual fields, then this works fine,
Hi all,
I have a question on the complex Parquet reader.
I note in code that if anything in the Parquet file in non-primitive, it
falls back to the other Parquet reader.
I've also noted, when this happens, no matter how many drill bits I have,
it doesn't seem to parallelize the processing - or a
ely. I can certainly meet in a hangout or
> just answer questions via e-mail if you needed help navigating the current
> code.
>
> -Jason
>
> On Thu, Jan 8, 2015 at 10:27 PM, Adam Gilmore
> wrote:
>
> > What about starting with something simple?
> >
> > For example
group/page.
What do you think?
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (Mobile)
1300 733 876 (AU)
+617 3171 9902 (Intl)
*PharmaData*
Data Intelligence Solutions for Pharmacy
www.PharmaData.net.au <http://www.pharmadata.net.au
r performance in the
> case of full table scans on nested/repeated data.
>
> - Jason
>
> On Thu, Jan 8, 2015 at 7:45 AM, Jacques Nadeau wrote:
>
> > That is correct.
> >
> > On Wed, Jan 7, 2015 at 7:57 PM, Adam Gilmore
> > wrote:
> >
> > > T
ustomerId.`$oid` =
> '54901607f10c2236769f7b3b' limit 1;
>
> as well as:
>
> select customerId from mongo.`connect`.events e where e.customerId.`$oid` =
> '54901607f10c2236769f7b3b' limit 1;
>
>
>
> On Thu, Jan 8, 2015 at 5:26 PM, A
stomerId.`$oid` =
> '54901607f10c2236769f7b3b' limit 1;
>
> Thanks,
> Jacques
>
> On Wed, Jan 7, 2015 at 12:44 AM, Adam Gilmore
> wrote:
>
> > Unfortunately, that didn't work. I tried:
> >
> > select * from mongo.`connect`.events w
|
++
1 row selected (0.261 seconds)
Strange results there - I played around with the second query and it seems
to be able to return anything but * nicely. So you're probably right.
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (M
large?
>
> thanks,
> Jacques
>
> On Tue, Jan 6, 2015 at 9:29 PM, Adam Gilmore
> wrote:
>
> > Anyone got any ideas on this one? I can consistently reproduce the issue
> > with HDFS - the minute I get the data off HDFS (to a local drive), it all
> > works fine.
>
n Thu, Jan 8, 2015 at 1:57 PM, Adam Gilmore wrote:
> That makes a lot of sense. Just one question with regarding to handling
> complex types - do you mean maps/arrays/etc. (repetitions in Parquet)? As
> in, if I created a Parquet table from some JSON files with a rather
> complex/nes
,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (Mobile)
1300 733 876 (AU)
+617 3171 9902 (Intl)
*PharmaData*
Data Intelligence Solutions for Pharmacy
www.PharmaData.net.au <http://www.pharmadata.net.au/>
[image: pharmadata-sig]
*Disclaimer*
ust made one, I put some comments there from the design discussions we
> have had in the past.
>
> https://issues.apache.org/jira/browse/DRILL-1950
>
> - Jason Altekruse
>
> On Tue, Jan 6, 2015 at 11:04 PM, Adam Gilmore
> wrote:
>
> > Just a quick follow up on this - i
-d185-4101-b466-1dd231808a9d on
ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
[ 91b9e166-d185-4101-b466-1dd231808a9d on
ip-10-8-1-70.ap-southeast-2.compute.internal:31010 ]
See:
https://www.dropbox.com/s/akdyfxb98q5adxg/saletest3.tgz?dl=0
On Wed, Jan 7, 2015 at 6:42 PM, Adam Gilmore wrote
ld return very
> quickly. Will also be interesting to compare query plans in Drill.
>
> —Andries
>
>
> On Jan 6, 2015, at 4:08 PM, Adam Gilmore wrote:
>
> > Hi all,
> >
> > I'm trying to test out Mongo with Drill but seem to be running into very
> &
>
> Note that currently Drill uses SQL expressions with dotted notation
> extensions for filters and doesn't currently support the mongodb based json
> object filters.
>
> On Tue, Jan 6, 2015 at 3:52 PM, Adam Gilmore
> wrote:
>
> > Hi Kamesh,
> >
> > U
ing? Is the file something you can share
> privately or publically or is too large?
>
> thanks,
> Jacques
>
> On Tue, Jan 6, 2015 at 9:29 PM, Adam Gilmore
> wrote:
>
> > Anyone got any ideas on this one? I can consistently reproduce the issue
> > with HDFS - the m
et me know how I can assist in reproducing.
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (Mobile)
1300 733 876 (AU)
+617 3171 9902 (Intl)
*PharmaData*
Data Intelligence Solutions for Pharmacy
www.PharmaData.net.au <http://www.pharmadata.net.au/>
ads.
>
> -Jason
>
>
>
> On Mon, Jan 5, 2015 at 8:15 AM, Adam Gilmore
> wrote:
>
> > Hi guys,
> >
> > I have a question re Parquet. I'm not sure if this is a Drill question
> or
> > Parquet, but thought I'd start here.
> >
>
Anyone got any ideas on this one? I can consistently reproduce the issue
with HDFS - the minute I get the data off HDFS (to a local drive), it all
works fine.
Doesn't seem to be a problem with Parquet - more like the HDFS storage
engine.
On Tue, Jan 6, 2015 at 9:50 AM, Adam Gilmore
Hi all,
I'm trying to test out Mongo with Drill but seem to be running into very
slow performance.
I have about 1M documents loaded into Mongo, and I'm doing something as
simple as:
select count(*) from mongo.`connect`.events group by collection;
where "collection" is a string field in the docu
Hi Kamesh,
Unfortunately it's not on _id - it's on another objectId field we have in
our documents. That seems to work fine with _id but with anything else, it
returns no results.
Any thoughts?
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 6
The data is okay, because the exact same Parquet directory is working fine
on the local drive, it's just not working when using HDFS. I tried casting
as you said, but that ended up with the exact same problem.
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+6
Hi Jason,
Understood - so currently Drill doesn't do predicate pushdown for Parquet?
Regards,
*Adam Gilmore*
Director of Technology
a...@pharmadata.net.au
+61 421 997 655 (Mobile)
1300 733 876 (AU)
+617 3171 9902 (Intl)
*PharmaData*
Data Intelligence Solutions for Pha
ata type. Also please verify that all the
> column data is satisfying your data type.
>
> Sudhakar Thota
> Sent from my iPhone
>
> > On Jan 5, 2015, at 5:56 AM, Adam Gilmore wrote:
> >
> > The actual stack trace is:
> >
> > 2015-01-05 13:48:27,356 [2b55
alue copies as we filtered out
> the records that were not needed. This currently takes place in a separate
> filter operator and should be pushed down into the read operation to make
> use of the file meta-data and eliminate some of the reads.
>
> -Jason
>
>
>
> On Mon,
Hi guys,
I have a question re Parquet. I'm not sure if this is a Drill question or
Parquet, but thought I'd start here.
I have a sample dataset of ~100M rows in a Parquet file. It's quick to sum
a single column across the whole dataset.
I have a column which has approx 100 unique values (e.g.
et-format-2.1.1-drill-r1.jar:na]
at parquet.format.Util.read(Util.java:47)
~[parquet-format-2.1.1-drill-r1.jar:na]
... 21 common frames omitted
On Mon, Jan 5, 2015 at 6:26 PM, Adam Gilmore wrote:
> Hi all,
>
> I'm trying to do a really simple query on a parquet dire
Hi all,
I'm trying to do a really simple query on a parquet directory on HDFS.
This works fine:
select count(*) from hdfs.warehouse.saleparquet
However, this fails:
0: jdbc:drill:local> select sum(sellprice) from hdfs.warehouse.saleparquet;
Query failed: Query failed: Failure while running fra
Hi all,
I'm trying to work out how to filter by an objectId field using the Mongo
plugin. I've tried many combinations of = '{$oid: ''id''}' etc. but
nothing seems to work.
Is this implemented yet? If not, is there a JIRA item for it already?
64 matches
Mail list logo