Re: Query hangs on planning

2016-08-31 Thread Oscar Morante

Sure,
This is what I remember:

* Failure
   - embedded mode on my laptop
   - drill memory: 2Gb/4Gb (heap/direct)
   - cpu: 4cores (+hyperthreading)
   - `planner.width.max_per_node=6`

* Success
   - AWS Cluster 2x c3.8xlarge
   - drill memory: 16Gb/32Gb
   - cpu: limited by kubernetes to 24cores
   - `planner.width.max_per_node=23`

I'm very busy right now to test again, but I'll try to provide better 
info as soon as I can.



On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote:

Can you please share the number of cores on the setup where the query hung
as compared to the number of cores on the setup where the query went
through successfully.
And details of memory from the two scenarios.

Thanks,
Khurram

On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante <spacep...@gmail.com> wrote:


For the record, I think this was just bad memory configuration after all.
I retested on bigger machines and everything seems to be working fine.


On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:


Oscar, can you please report a JIRA with the required steps to reproduce
the OOM error. That way someone from the Drill team will take a look and
investigate.

For others interested here is the stack trace.

2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman]
ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure
Occurred,
exiting. Information message: Unable to handle out of memory condition in
Foreman.
java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111]
   at java.lang.String.(String.java:203) ~[na:1.7.0_111]
   at java.lang.StringBuilder.toString(StringBuilder.java:405)
~[na:1.7.0_111]
   at org.apache.calcite.util.Util.newInternal(Util.java:785)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(
VolcanoRuleCall.java:251)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(
VolcanoPlanner.java:808)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.transform(DefaultSqlHandler.java:404)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.transform(DefaultSqlHandler.java:343)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.convertToDrel(DefaultSqlHandler.java:240)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler
.convertToDrel(DefaultSqlHandler.java:290)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge
tPlan(ExplainHandler.java:61)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri
llSqlWorker.java:94)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:
257)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
Executor.java:1145)
[na:1.7.0_111]
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
lExecutor.java:615)
[na:1.7.0_111]
   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]

Thanks,
Khurram

On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <spacep...@gmail.com>
wrote:

Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),

explain succeeds within ~30s.  Enabling any of the other lines triggers
the
failure.

This is a log with the `upload_date` lines and `usage <> 'Test'` enabled:
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e

The client times out around here (~1.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

And it still keeps running for a while until it dies (~2.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

The memory settings for this test were:

   DRILL_HEAP="4G"
   DRILL_MAX_DIRECT_MEMORY="8G"

This is on a laptop with 16G and I should probably lower it, but it seems
a bit excessive for such a small query.  And I think I got the same
results
on a 2 node cluster with 8/16.  I'm gonna try again on the cluster to
make
sure.

Thanks,
Oscar


On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote:

You mentioned "*But if I uncomment 

Re: Query hangs on planning

2016-08-31 Thread Oscar Morante
For the record, I think this was just bad memory configuration after 
all.  I retested on bigger machines and everything seems to be working 
fine.


On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote:

Oscar, can you please report a JIRA with the required steps to reproduce
the OOM error. That way someone from the Drill team will take a look and
investigate.

For others interested here is the stack trace.

2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman]
ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred,
exiting. Information message: Unable to handle out of memory condition in
Foreman.
java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111]
   at java.lang.String.(String.java:203) ~[na:1.7.0_111]
   at java.lang.StringBuilder.toString(StringBuilder.java:405)
~[na:1.7.0_111]
   at org.apache.calcite.util.Util.newInternal(Util.java:785)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:251)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303)
~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:404)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:343)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:240)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:290)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.handlers.ExplainHandler.getPlan(ExplainHandler.java:61)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:94)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:257)
~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT]
   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_111]
   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_111]
   at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111]

Thanks,
Khurram

On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <spacep...@gmail.com> wrote:


Yeah, when I uncomment only the `upload_date` lines (a dir0 alias),
explain succeeds within ~30s.  Enabling any of the other lines triggers the
failure.

This is a log with the `upload_date` lines and `usage <> 'Test'` enabled:
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e

The client times out around here (~1.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

And it still keeps running for a while until it dies (~2.5hours):
https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022
b3c55e#file-drillbit-log-L178

The memory settings for this test were:

   DRILL_HEAP="4G"
   DRILL_MAX_DIRECT_MEMORY="8G"

This is on a laptop with 16G and I should probably lower it, but it seems
a bit excessive for such a small query.  And I think I got the same results
on a 2 node cluster with 8/16.  I'm gonna try again on the cluster to make
sure.

Thanks,
Oscar


On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote:


You mentioned "*But if I uncomment the where clause then it runs for a
couple of hours until it runs out of memory.*"

Can you please share the OutOfMemory details from drillbit.log and the
value of DRILL_MAX_DIRECT_MEMORY

Can you also try to see what happens if you retain just this line where
upload_date = '2016-08-01' in your where clause, can you check if the
explain succeeds.

Thanks,
Khurram

On Tue, Aug 9, 2016 at 4:00 PM, Oscar Morante <spacep...@gmail.com>
wrote:

Hi there,

I've been stuck with this for a while and I'm not sure if I'm running
into
a bug or I'm just doing something very wrong.

I have this stripped-down version of my query:
https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b

The data is just a single file with one record (1.5K).

Without changing anything, explain takes ~1sec on my machine.  

Re: Drill CPU Usage

2016-08-15 Thread Oscar Morante
jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:183)
at 
jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:173)
at 
jline.internal.TerminalLineSettings.stty(TerminalLineSettings.java:168)
at jline.internal.TerminalLineSettings.set(TerminalLineSettings.java:76)
at jline.UnixTerminal.enableInterruptCharacter(UnixTerminal.java:122)
at jline.console.ConsoleReader.readLine(ConsoleReader.java:2859)
at jline.console.ConsoleReader.readLine(ConsoleReader.java:2126)
at sqlline.SqlLine.begin(SqlLine.java:621)
at sqlline.SqlLine.start(SqlLine.java:375)
at sqlline.SqlLine.main(SqlLine.java:268)

  Locked ownable synchronizers:
- None

"VM Thread" os_prio=0 tid=0x7f4c4c09f000 nid=0x1dbd9 runnable

"GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7f4c4c025800 nid=0x1dbcc 
runnable

"GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7f4c4c027800 nid=0x1dbcd 
runnable

"GC task thread#2 (ParallelGC)" os_prio=0 tid=0x7f4c4c029000 nid=0x1dbce 
runnable

"GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7f4c4c02b000 nid=0x1dbcf 
runnable

"GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7f4c4c02c800 nid=0x1dbd0 
runnable

"GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7f4c4c02e800 nid=0x1dbd1 
runnable

"GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7f4c4c030800 nid=0x1dbd2 
runnable

"GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7f4c4c032000 nid=0x1dbd3 
runnable

"GC task thread#8 (ParallelGC)" os_prio=0 tid=0x7f4c4c034000 nid=0x1dbd4 
runnable

"GC task thread#9 (ParallelGC)" os_prio=0 tid=0x7f4c4c036000 nid=0x1dbd5 
runnable

"GC task thread#10 (ParallelGC)" os_prio=0 tid=0x7f4c4c037800 nid=0x1dbd6 
runnable

"GC task thread#11 (ParallelGC)" os_prio=0 tid=0x7f4c4c039800 nid=0x1dbd7 
runnable

"GC task thread#12 (ParallelGC)" os_prio=0 tid=0x7f4c4c03b000 nid=0x1dbd8 
runnable

"VM Periodic Task Thread" os_prio=0 tid=0x7f4c4c10f800 nid=0x1dbea waiting 
on condition

JNI global references: 314


-Original Message-
From: Avi Haleva [mailto:avi.hal...@aternity.com]
Sent: Thursday, August 11, 2016 14:10
To: user@drill.apache.org
Subject: RE: Drill CPU Usage

Both the log and out files looks clean

Attached you can find the thread dump output (3 snapshots).
I will appreciate if one can point me to the reason the drillbits consume 
constantly CPU

BTW, the CPU usage is now a constant at 40% for a single core (double from a 
week ago)


-Original Message-
From: Dechang Gu [mailto:d...@maprtech.com]
Sent: Tuesday, August 2, 2016 22:21
To: user@drill.apache.org
Subject: Re: Drill CPU Usage

On Tue, Aug 2, 2016 at 12:49 AM, Avi Haleva <avi.hal...@aternity.com> wrote:


Hi,
I'm in the process of evaluating Drill as a analytic repositiory.
I've noticed that even when idle, the drillbit process consume 20% of
a single core constantly.

Is that expected behavior or have I miss configured something.



I don't think that is expected. I checked my system, the drillbit CPU usage 
stays at 0% almost constantly when idle.
I did see occasionally it has a glitch at 20-30% CPU usage, due to zookeeper 
heartbeat traffic, I guess.

Can you collect jstack on the drillbit process to find out what is running?

Also check drillbit.out and drillbit.log (typically under /var/log/drill/ ) to 
see if there is anything suspicious.

HTH, and Thanks,
Dechang




I'm running drill in single node with zookeeper in a non-embeded mode
(as I want to have more than a single connection concurrently)

Thanks,
Avi




--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: Dockerized Drill with Bridged Network

2016-08-09 Thread Oscar Morante

You can try this image if you want to do a quick test:

   https://hub.docker.com/r/miradatv/apache-drill/

It has a few patches though.


On Tue, Aug 09, 2016 at 05:36:56PM +0300, Oscar Morante wrote:

Hi Suhas,
I've been running a Drill cluster on Kubernetes just by patching this:

  
https://github.com/miradatv/drill/commit/ef139a4d8645936a0ecbc8318546fad99ee9609b

I think it should also work on Mesos.
Cheers,


On Mon, Aug 08, 2016 at 12:55:32PM -0700, Suhas Gaddam wrote:

Hi,

I am trying to deploy a dockerized Drill in a mesos cluster with bridge
networking and in distributed mode. I can successfully run a single
instance of drill but when I try to scale to 2 instances Drill breaks with
java.io.IOException: Failed to create DrillClient: CONNECTION :
java.nio.channels.UnresolvedAddressException

From my first pass at debugging, I think this is related to the host name
that Drill publishes to zookeeper. When run in a bridged docker container,
Drill seems to put the hostname from inside the docker container in the
zookeeper node which has no meaning outside of that instance. Is there a
way to specify the hostname Drill publishes?

https://github.com/vadio/docker-apache-drill

Thanks,

Suhas


signature.asc
Description: Digital signature


Re: Dockerized Drill with Bridged Network

2016-08-09 Thread Oscar Morante

Hi Suhas,
I've been running a Drill cluster on Kubernetes just by patching this:

   
https://github.com/miradatv/drill/commit/ef139a4d8645936a0ecbc8318546fad99ee9609b

I think it should also work on Mesos.
Cheers,


On Mon, Aug 08, 2016 at 12:55:32PM -0700, Suhas Gaddam wrote:

Hi,

I am trying to deploy a dockerized Drill in a mesos cluster with bridge
networking and in distributed mode. I can successfully run a single
instance of drill but when I try to scale to 2 instances Drill breaks with
java.io.IOException: Failed to create DrillClient: CONNECTION :
java.nio.channels.UnresolvedAddressException

From my first pass at debugging, I think this is related to the host name
that Drill publishes to zookeeper. When run in a bridged docker container,
Drill seems to put the hostname from inside the docker container in the
zookeeper node which has no meaning outside of that instance. Is there a
way to specify the hostname Drill publishes?

https://github.com/vadio/docker-apache-drill

Thanks,

Suhas


signature.asc
Description: Digital signature


Query hangs on planning

2016-08-09 Thread Oscar Morante

Hi there,
I've been stuck with this for a while and I'm not sure if I'm running 
into a bug or I'm just doing something very wrong.


I have this stripped-down version of my query: 
https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b


The data is just a single file with one record (1.5K).

Without changing anything, explain takes ~1sec on my machine.  But if I 
uncomment the where clause then it runs for a couple of hours until it 
runs out of memory.


Also if I uncomment the where clause *and* take out the join, then it 
takes around 30s to plan.


Any ideas?
Thanks!



signature.asc
Description: Digital signature


Re: Partition prunning using CURRENT_DATE?

2016-07-25 Thread Oscar Morante
Great, thanks!  I'm gonna try swapping the view file and see how it 
goes.


On Thu, Jul 21, 2016 at 11:09:45AM -0500, John Omernik wrote:

Yes, I have a view that has the hard coded date in it. It wasn't difficult,
and using the REST API was actually fairly neat/clean.  I agree with you,
it would be nice, but this worked pretty well for me too.  (I also wonder
if you could just change the raw view def file and update the date)


John


On Thu, Jul 21, 2016 at 2:26 AM, Oscar Morante <spacep...@gmail.com> wrote:


Hi John,
I've been following your trail of emails :)  Thanks for sharing all that
info, it's very useful.

I think I'm trying to do something very similar to what you did.  I have
data flowing from Storm into S3 and I wanted to be able to periodically
preprocess/repartition into new folder and then have views to merge recent
data from the raw Storm files and old data from the
preprocessed/repartitioned folders.  These views are intended to be used
from Tableau.

I guess I can create a small process that checks when a new folder with
preprocessed data is available and replaces the appropriate view files with
new versions that have the proper date string.  But it would be a lot nicer
to just do it in the view and have a dumb process executing the periodic
queries.

How did you "solve" it in the end?  If I can ask.

Thanks,
Oscar


On Wed, Jul 20, 2016 at 07:26:20AM -0500, John Omernik wrote:


I think I ran into that issue before and (someone will correct me if I am
wrong) the issue is that current_date is only materialized AFTER planning.
Thus the pruning, which occurs during planning doesn't happen.  Is this a
programatic query or just something that is being done for users? I know
my
issue was I wanted a view that showed only the current date, and I
struggled to come up with a good solution to that.

John


On Wed, Jul 20, 2016 at 6:06 AM, Oscar Morante <spacep...@gmail.com>
wrote:

I'm trying to trigger partition prunning like this:


   select *
   from dfs.`json/by-date`
   where dir0 = cast(current_date as varchar);

But apparently, it only works when passing a literal.  Am I missing
something?

Thanks,







--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: Partition prunning using CURRENT_DATE?

2016-07-21 Thread Oscar Morante

Hi John,
I've been following your trail of emails :)  Thanks for sharing all that 
info, it's very useful.


I think I'm trying to do something very similar to what you did.  I have 
data flowing from Storm into S3 and I wanted to be able to periodically 
preprocess/repartition into new folder and then have views to merge 
recent data from the raw Storm files and old data from the 
preprocessed/repartitioned folders.  These views are intended to be used 
from Tableau.


I guess I can create a small process that checks when a new folder with 
preprocessed data is available and replaces the appropriate view files 
with new versions that have the proper date string.  But it would be a 
lot nicer to just do it in the view and have a dumb process executing 
the periodic queries.


How did you "solve" it in the end?  If I can ask.

Thanks,
Oscar

On Wed, Jul 20, 2016 at 07:26:20AM -0500, John Omernik wrote:

I think I ran into that issue before and (someone will correct me if I am
wrong) the issue is that current_date is only materialized AFTER planning.
Thus the pruning, which occurs during planning doesn't happen.  Is this a
programatic query or just something that is being done for users? I know my
issue was I wanted a view that showed only the current date, and I
struggled to come up with a good solution to that.

John


On Wed, Jul 20, 2016 at 6:06 AM, Oscar Morante <spacep...@gmail.com> wrote:


I'm trying to trigger partition prunning like this:

   select *
   from dfs.`json/by-date`
   where dir0 = cast(current_date as varchar);

But apparently, it only works when passing a literal.  Am I missing
something?

Thanks,




signature.asc
Description: Digital signature


Partition prunning using CURRENT_DATE?

2016-07-20 Thread Oscar Morante

I'm trying to trigger partition prunning like this:

   select *
   from dfs.`json/by-date`
   where dir0 = cast(current_date as varchar);

But apparently, it only works when passing a literal.  Am I missing 
something?


Thanks,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: S3 query takes a long time to start

2016-03-13 Thread Oscar Morante

Hi Jacques,
All these technologies are pretty new to me, but I can give it a shot :)

Is there any literature that can help me understand how things are set 
up?


Cheers,


On Fri, Mar 11, 2016 at 01:42:54PM -0800, Jacques Nadeau wrote:

I've been thinking a lot of this. Definitely think there should be a clean
fix to this but haven't had the cycles to suggest something. You up for
looking at the code and trying to suggest something?

thanks!

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Mar 10, 2016 at 8:06 AM, Oscar Morante <spacep...@gmail.com> wrote:


I've been checking the logs, and I think that the problem is that it's
walking through the "directories" in S3 recursively, doing lots of small
HTTP requests.

My files are organized like this which amplifies the issue:

   /category/random-hash/year/month/day/hour/data-chunk-000.json.gz

The random hash is there to trick S3 into using a different
partition/shard for each put [1].  But it looks like this structure is
clashing with the way Drill/hadoop.fs.s3a get the list of files.

I think that it should be possible to get the complete list of files under
a given "directory" (e.g. `/category`) doing just one HTTP query, but I
don't know how hard it would be to incorporate that behavior.

Any ideas?  How are you organizing your S3 files to get good performance?

Thanks!

[1]:
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html



On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote:


I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes
ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge nodes and
it's taking around 50 minutes before the query starts executing.

Any idea what could be causing this delay?  What's the best way to debug
this?

Thanks,



--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: S3 query takes a long time to start

2016-03-10 Thread Oscar Morante
I've been checking the logs, and I think that the problem is that it's 
walking through the "directories" in S3 recursively, doing lots of small 
HTTP requests.


My files are organized like this which amplifies the issue:

   /category/random-hash/year/month/day/hour/data-chunk-000.json.gz

The random hash is there to trick S3 into using a different 
partition/shard for each put [1].  But it looks like this structure is 
clashing with the way Drill/hadoop.fs.s3a get the list of files.


I think that it should be possible to get the complete list of files 
under a given "directory" (e.g. `/category`) doing just one HTTP query, 
but I don't know how hard it would be to incorporate that behavior.


Any ideas?  How are you organizing your S3 files to get good 
performance?


Thanks!

[1]: 
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html


On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote:
I'm querying 20G of gzipped JSONs split in ~5600 small files with 
sizes ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge 
nodes and it's taking around 50 minutes before the query starts 
executing.


Any idea what could be causing this delay?  What's the best way to 
debug this?


Thanks,


--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


S3 query takes a long time to start

2016-03-10 Thread Oscar Morante
I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes 
ranging from 1M to 30Mb.  Drill is running in aws in 4 m4.xlarge nodes 
and it's taking around 50 minutes before the query starts executing.


Any idea what could be causing this delay?  What's the best way to debug 
this?


Thanks,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Time intervals overflowing?

2016-03-10 Thread Oscar Morante
I was trying to generate a sequence of timestamps from a sequence of 
numbers, and run into this:


   select
 596 * interval '1' hour,
 597 * interval '1' hour
   from (values 1);

   EXPR$0 |EXPR$1  |
   ---||
   PT2145600S |PT-2145767.296S |

Notice that the second expression turns into a negative interval.  Is 
that the expected behavior?


Cheers,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: Need some help debugging slow S3 download speed

2016-02-29 Thread Oscar Morante


You were right and the time is actually spent on decompressing the 
files, I must have messed up my initial profiling on that.  And I'm 
already using the native libraries so I guess that's as good as it gets.


After checking that an uncompressed file downloads at full speed, I did 
some more tests and it turns out that the query on the gzipped file is 
faster even though the download rate is lower!  So at least I learned a 
good lesson about making wrong assumptions, hehe.


Anyways, I'm still wondering if there are any other places I can squeeze 
some extra performance from, and I can only think of:


- Maybe using snappy to decompress?  I tried to find where the 
 decompression takes place but I couldn't find it.
- Use a different compression algorithm, any recommendations based on 
 experience?


Thanks!


On Sun, Feb 21, 2016 at 04:58:17PM -0800, Jacques Nadeau wrote:

The zipped question is a good one. I believe you need to add extra native
libraries to get reasonable performance when using gzip files.

See if you are seeing this in your logs or out:

Unable to load native-hadoop library for your platform

If so, probably need to get them setup per here:

https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html


On Sun, Feb 21, 2016 at 11:53 AM, Oscar Morante <spacep...@gmail.com> wrote:


I'm still fighting with this, so far I've tried:

 - Adding the native hadoop libraries.
 - `s3n://` and `s3a://`.
 - SSL on/off.
 - different S3 endpoints.
 - different values in `drill.exec.buffer.size`.

None of these seem to make a difference and `s3cmd` is always ten times
faster than Drill to download the same file.  Netstat shows that about 800k
piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the
whole time.

I've also noticed that if I cancel the query in the middle, the download
speed suddently goes up and matches s3cmd for a while before it stops.

Is there anything else that I can try to improve the situation?  At the
begining I thought that S3 was the bottleneck but everything is pointing to
kind of lock in Drill.

Or maybe I'm just being unrealistic and asking too much :?
Cheers,



On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:


Hi there,

I'm experiencing very slow download rates from S3 but only when using
Drill.  This is testing with only one drillbit and querying a 250Mb gzipped
JSON:

  select count(somefield) from s3.`test/big.json.gz`;

The download speed while drill is executing the query is about 5Mb/s.
Then if I try downloading the same file from the same environment using
`s3cmd` the average speed is about 60Mb/s.

Any idea what could be causing such a big difference?  I'm not sure
what's the best way to debug this, or what are the relevant configuration
parameters that I should be tweaking.

Thanks!


--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: Need some help debugging slow S3 download speed

2016-02-21 Thread Oscar Morante

I'm still fighting with this, so far I've tried:

 - Adding the native hadoop libraries.
 - `s3n://` and `s3a://`.
 - SSL on/off.
 - different S3 endpoints.
 - different values in `drill.exec.buffer.size`.

None of these seem to make a difference and `s3cmd` is always ten times 
faster than Drill to download the same file.  Netstat shows that about 
800k piling up in Recv-Q during the query, and `s3cmd` is pretty much 
clean the whole time.


I've also noticed that if I cancel the query in the middle, the download 
speed suddently goes up and matches s3cmd for a while before it stops.


Is there anything else that I can try to improve the situation?  At the 
begining I thought that S3 was the bottleneck but everything is pointing 
to kind of lock in Drill.


Or maybe I'm just being unrealistic and asking too much :?
Cheers,


On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote:

Hi there,

I'm experiencing very slow download rates from S3 but only when using 
Drill.  This is testing with only one drillbit and querying a 250Mb 
gzipped JSON:


  select count(somefield) from s3.`test/big.json.gz`;

The download speed while drill is executing the query is about 5Mb/s.  
Then if I try downloading the same file from the same environment 
using `s3cmd` the average speed is about 60Mb/s.


Any idea what could be causing such a big difference?  I'm not sure 
what's the best way to debug this, or what are the relevant 
configuration parameters that I should be tweaking.


Thanks!




--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Need some help debugging slow S3 download speed

2016-02-19 Thread Oscar Morante

Hi there,

I'm experiencing very slow download rates from S3 but only when using 
Drill.  This is testing with only one drillbit and querying a 250Mb 
gzipped JSON:


   select count(somefield) from s3.`test/big.json.gz`;

The download speed while drill is executing the query is about 5Mb/s.  
Then if I try downloading the same file from the same environment using 
`s3cmd` the average speed is about 60Mb/s.


Any idea what could be causing such a big difference?  I'm not sure 
what's the best way to debug this, or what are the relevant 
configuration parameters that I should be tweaking.


Thanks!

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: 'DATE - DATE' not supported?

2016-02-10 Thread Oscar Morante
Thanks for the explanation, I'll just modify the queries in the meantime 
:)


On Tue, Feb 09, 2016 at 11:52:12AM -0800, Hsuan Yi Chu wrote:

The reason you are seeing this is because the types are known to Drill at
the planning time.

The planning engine Drill uses is Calcite which does not allow this
operation to happen. (as the error message suggests).

One workaround you could take is to cover up the type information by
cast(... as ANY), such as:

SELECT cast(DATE '2009-01-01' as any) - cast(DATE '2009-05-05' as any) FROM
(values(1));


On the other hand, the community is doing some work to relax the type
constraints (a component of Drill-4372). Hopefully, you could run your
original query in the near future.



On Tue, Feb 9, 2016 at 6:47 AM, Oscar Morante <spacep...@gmail.com> wrote:



Hi there,

I'm new to both the list and Drill.  I'm doing some tests using Tableau
and when I use the `DATEDIFF` function in a calculated field Tableau
generates queries that contain stuff like this:

   SELECT DATE '2009-01-01' - DATE '2009-05-05' FROM (values(1));

...which throws this error:

   ExampleExceptionFormatter: exception message was: VALIDATION ERROR:
From line 1, column 8 to line 1, column 44: Cannot apply '-' to
arguments of type ' - '. Supported form(s): ' -
'
   ' - '
   ' - '

I was looking around and according to the original task (DRILL-549) that
kind of query should be supported.  Is this a regression?  Am I doing
something wrong?

Thanks,


--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


'DATE - DATE' not supported?

2016-02-09 Thread Oscar Morante


Hi there,

I'm new to both the list and Drill.  I'm doing some tests using Tableau 
and when I use the `DATEDIFF` function in a calculated field Tableau 
generates queries that contain stuff like this:


   SELECT DATE '2009-01-01' - DATE '2009-05-05' FROM (values(1));

...which throws this error:

   ExampleExceptionFormatter: exception message was: VALIDATION ERROR: 
   From line 1, column 8 to line 1, column 44: Cannot apply '-' to 
   arguments of type ' - '. Supported form(s): ' - 
   '

   ' - '
   ' - '

I was looking around and according to the original task (DRILL-549) that 
kind of query should be supported.  Is this a regression?  Am I doing 
something wrong?


Thanks,

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Source for drill's calcite?

2016-02-09 Thread Oscar Morante
I'm trying to add support for week and weekdays to `date_trunc` and 
`date_part`.  It seems to be working fine right now except that I need 
to patch calcite's TimeUnit so that the parser doesn't complain when 
using `extract` directly.


I first tried using the latest calcite but it's too different from the 
version Drill is using, and `1.4.0-incubation` it doesn't work either.


I've tried to look for the source but I can't seem to find 
`1.4.0-drill-r10` anywhere.


Any ideas?


PS: Here's the patch so far -> 
https://gist.github.com/spacepluk/40df5a90ddee2efe1f4a

--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature


Re: Source for drill's calcite?

2016-02-09 Thread Oscar Morante

Thanks! :)

On Tue, Feb 09, 2016 at 08:30:44AM -0800, Abdel Hakim Deneche wrote:

You can find the r10 branch here:

https://github.com/mapr/incubator-calcite/tree/DrillCalcite1.4.0

On Tue, Feb 9, 2016 at 8:00 AM, Jason Altekruse <altekruseja...@gmail.com>
wrote:


I can't find the latest version either, but this is the r9 branch. I don't
think any very major changes happened in the last update (it's likely just
back-porting a fix from calcite master). So you can base your work on this
branch and rebase it when someone points you to the updated branch.

https://github.com/dremio/calcite/commits/1.4.0-drill-r9

On Tue, Feb 9, 2016 at 7:46 AM, Oscar Morante <spacep...@gmail.com> wrote:

> I'm trying to add support for week and weekdays to `date_trunc` and
> `date_part`.  It seems to be working fine right now except that I need to
> patch calcite's TimeUnit so that the parser doesn't complain when using
> `extract` directly.
>
> I first tried using the latest calcite but it's too different from the
> version Drill is using, and `1.4.0-incubation` it doesn't work either.
>
> I've tried to look for the source but I can't seem to find
> `1.4.0-drill-r10` anywhere.
>
> Any ideas?
>
>
> PS: Here's the patch so far ->
> https://gist.github.com/spacepluk/40df5a90ddee2efe1f4a


--
Oscar Morante
"Self-education is, I firmly believe, the only kind of education there is."
 -- Isaac Asimov.


signature.asc
Description: Digital signature