Re: Query hangs on planning
Sure, This is what I remember: * Failure - embedded mode on my laptop - drill memory: 2Gb/4Gb (heap/direct) - cpu: 4cores (+hyperthreading) - `planner.width.max_per_node=6` * Success - AWS Cluster 2x c3.8xlarge - drill memory: 16Gb/32Gb - cpu: limited by kubernetes to 24cores - `planner.width.max_per_node=23` I'm very busy right now to test again, but I'll try to provide better info as soon as I can. On Wed, Aug 31, 2016 at 05:38:53PM +0530, Khurram Faraaz wrote: Can you please share the number of cores on the setup where the query hung as compared to the number of cores on the setup where the query went through successfully. And details of memory from the two scenarios. Thanks, Khurram On Wed, Aug 31, 2016 at 4:50 PM, Oscar Morante <spacep...@gmail.com> wrote: For the record, I think this was just bad memory configuration after all. I retested on bigger machines and everything seems to be working fine. On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote: Oscar, can you please report a JIRA with the required steps to reproduce the OOM error. That way someone from the Drill team will take a look and investigate. For others interested here is the stack trace. 2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in Foreman. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111] at java.lang.String.(String.java:203) ~[na:1.7.0_111] at java.lang.StringBuilder.toString(StringBuilder.java:405) ~[na:1.7.0_111] at org.apache.calcite.util.Util.newInternal(Util.java:785) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch( VolcanoRuleCall.java:251) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp( VolcanoPlanner.java:808) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .transform(DefaultSqlHandler.java:404) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .transform(DefaultSqlHandler.java:343) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .convertToDrel(DefaultSqlHandler.java:240) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler .convertToDrel(DefaultSqlHandler.java:290) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.ExplainHandler.ge tPlan(ExplainHandler.java:61) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(Dri llSqlWorker.java:94) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java: 257) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool Executor.java:1145) [na:1.7.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo lExecutor.java:615) [na:1.7.0_111] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] Thanks, Khurram On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <spacep...@gmail.com> wrote: Yeah, when I uncomment only the `upload_date` lines (a dir0 alias), explain succeeds within ~30s. Enabling any of the other lines triggers the failure. This is a log with the `upload_date` lines and `usage <> 'Test'` enabled: https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e The client times out around here (~1.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 And it still keeps running for a while until it dies (~2.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 The memory settings for this test were: DRILL_HEAP="4G" DRILL_MAX_DIRECT_MEMORY="8G" This is on a laptop with 16G and I should probably lower it, but it seems a bit excessive for such a small query. And I think I got the same results on a 2 node cluster with 8/16. I'm gonna try again on the cluster to make sure. Thanks, Oscar On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote: You mentioned "*But if I uncomment
Re: Query hangs on planning
For the record, I think this was just bad memory configuration after all. I retested on bigger machines and everything seems to be working fine. On Tue, Aug 09, 2016 at 10:46:33PM +0530, Khurram Faraaz wrote: Oscar, can you please report a JIRA with the required steps to reproduce the OOM error. That way someone from the Drill team will take a look and investigate. For others interested here is the stack trace. 2016-08-09 16:51:14,280 [285642de-ab37-de6e-a54c-378aaa4ce50e:foreman] ERROR o.a.drill.common.CatastrophicFailure - Catastrophic Failure Occurred, exiting. Information message: Unable to handle out of memory condition in Foreman. java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOfRange(Arrays.java:2694) ~[na:1.7.0_111] at java.lang.String.(String.java:203) ~[na:1.7.0_111] at java.lang.StringBuilder.toString(StringBuilder.java:405) ~[na:1.7.0_111] at org.apache.calcite.util.Util.newInternal(Util.java:785) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:251) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:808) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:303) ~[calcite-core-1.4.0-drill-r16-PATCHED.jar:1.4.0-drill-r16-PATCHED] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:404) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.transform(DefaultSqlHandler.java:343) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:240) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.DefaultSqlHandler.convertToDrel(DefaultSqlHandler.java:290) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.handlers.ExplainHandler.getPlan(ExplainHandler.java:61) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.planner.sql.DrillSqlWorker.getPlan(DrillSqlWorker.java:94) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.runSQL(Foreman.java:978) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at org.apache.drill.exec.work.foreman.Foreman.run(Foreman.java:257) ~[drill-java-exec-1.8.0-SNAPSHOT.jar:1.8.0-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_111] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_111] Thanks, Khurram On Tue, Aug 9, 2016 at 7:46 PM, Oscar Morante <spacep...@gmail.com> wrote: Yeah, when I uncomment only the `upload_date` lines (a dir0 alias), explain succeeds within ~30s. Enabling any of the other lines triggers the failure. This is a log with the `upload_date` lines and `usage <> 'Test'` enabled: https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022b3c55e The client times out around here (~1.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 And it still keeps running for a while until it dies (~2.5hours): https://gist.github.com/spacepluk/d7ac11c0de6859e4bd003d2022 b3c55e#file-drillbit-log-L178 The memory settings for this test were: DRILL_HEAP="4G" DRILL_MAX_DIRECT_MEMORY="8G" This is on a laptop with 16G and I should probably lower it, but it seems a bit excessive for such a small query. And I think I got the same results on a 2 node cluster with 8/16. I'm gonna try again on the cluster to make sure. Thanks, Oscar On Tue, Aug 09, 2016 at 04:13:17PM +0530, Khurram Faraaz wrote: You mentioned "*But if I uncomment the where clause then it runs for a couple of hours until it runs out of memory.*" Can you please share the OutOfMemory details from drillbit.log and the value of DRILL_MAX_DIRECT_MEMORY Can you also try to see what happens if you retain just this line where upload_date = '2016-08-01' in your where clause, can you check if the explain succeeds. Thanks, Khurram On Tue, Aug 9, 2016 at 4:00 PM, Oscar Morante <spacep...@gmail.com> wrote: Hi there, I've been stuck with this for a while and I'm not sure if I'm running into a bug or I'm just doing something very wrong. I have this stripped-down version of my query: https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b The data is just a single file with one record (1.5K). Without changing anything, explain takes ~1sec on my machine.
Re: Drill CPU Usage
jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:183) at jline.internal.TerminalLineSettings.exec(TerminalLineSettings.java:173) at jline.internal.TerminalLineSettings.stty(TerminalLineSettings.java:168) at jline.internal.TerminalLineSettings.set(TerminalLineSettings.java:76) at jline.UnixTerminal.enableInterruptCharacter(UnixTerminal.java:122) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2859) at jline.console.ConsoleReader.readLine(ConsoleReader.java:2126) at sqlline.SqlLine.begin(SqlLine.java:621) at sqlline.SqlLine.start(SqlLine.java:375) at sqlline.SqlLine.main(SqlLine.java:268) Locked ownable synchronizers: - None "VM Thread" os_prio=0 tid=0x7f4c4c09f000 nid=0x1dbd9 runnable "GC task thread#0 (ParallelGC)" os_prio=0 tid=0x7f4c4c025800 nid=0x1dbcc runnable "GC task thread#1 (ParallelGC)" os_prio=0 tid=0x7f4c4c027800 nid=0x1dbcd runnable "GC task thread#2 (ParallelGC)" os_prio=0 tid=0x7f4c4c029000 nid=0x1dbce runnable "GC task thread#3 (ParallelGC)" os_prio=0 tid=0x7f4c4c02b000 nid=0x1dbcf runnable "GC task thread#4 (ParallelGC)" os_prio=0 tid=0x7f4c4c02c800 nid=0x1dbd0 runnable "GC task thread#5 (ParallelGC)" os_prio=0 tid=0x7f4c4c02e800 nid=0x1dbd1 runnable "GC task thread#6 (ParallelGC)" os_prio=0 tid=0x7f4c4c030800 nid=0x1dbd2 runnable "GC task thread#7 (ParallelGC)" os_prio=0 tid=0x7f4c4c032000 nid=0x1dbd3 runnable "GC task thread#8 (ParallelGC)" os_prio=0 tid=0x7f4c4c034000 nid=0x1dbd4 runnable "GC task thread#9 (ParallelGC)" os_prio=0 tid=0x7f4c4c036000 nid=0x1dbd5 runnable "GC task thread#10 (ParallelGC)" os_prio=0 tid=0x7f4c4c037800 nid=0x1dbd6 runnable "GC task thread#11 (ParallelGC)" os_prio=0 tid=0x7f4c4c039800 nid=0x1dbd7 runnable "GC task thread#12 (ParallelGC)" os_prio=0 tid=0x7f4c4c03b000 nid=0x1dbd8 runnable "VM Periodic Task Thread" os_prio=0 tid=0x7f4c4c10f800 nid=0x1dbea waiting on condition JNI global references: 314 -Original Message- From: Avi Haleva [mailto:avi.hal...@aternity.com] Sent: Thursday, August 11, 2016 14:10 To: user@drill.apache.org Subject: RE: Drill CPU Usage Both the log and out files looks clean Attached you can find the thread dump output (3 snapshots). I will appreciate if one can point me to the reason the drillbits consume constantly CPU BTW, the CPU usage is now a constant at 40% for a single core (double from a week ago) -Original Message- From: Dechang Gu [mailto:d...@maprtech.com] Sent: Tuesday, August 2, 2016 22:21 To: user@drill.apache.org Subject: Re: Drill CPU Usage On Tue, Aug 2, 2016 at 12:49 AM, Avi Haleva <avi.hal...@aternity.com> wrote: Hi, I'm in the process of evaluating Drill as a analytic repositiory. I've noticed that even when idle, the drillbit process consume 20% of a single core constantly. Is that expected behavior or have I miss configured something. I don't think that is expected. I checked my system, the drillbit CPU usage stays at 0% almost constantly when idle. I did see occasionally it has a glitch at 20-30% CPU usage, due to zookeeper heartbeat traffic, I guess. Can you collect jstack on the drillbit process to find out what is running? Also check drillbit.out and drillbit.log (typically under /var/log/drill/ ) to see if there is anything suspicious. HTH, and Thanks, Dechang I'm running drill in single node with zookeeper in a non-embeded mode (as I want to have more than a single connection concurrently) Thanks, Avi -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: Dockerized Drill with Bridged Network
You can try this image if you want to do a quick test: https://hub.docker.com/r/miradatv/apache-drill/ It has a few patches though. On Tue, Aug 09, 2016 at 05:36:56PM +0300, Oscar Morante wrote: Hi Suhas, I've been running a Drill cluster on Kubernetes just by patching this: https://github.com/miradatv/drill/commit/ef139a4d8645936a0ecbc8318546fad99ee9609b I think it should also work on Mesos. Cheers, On Mon, Aug 08, 2016 at 12:55:32PM -0700, Suhas Gaddam wrote: Hi, I am trying to deploy a dockerized Drill in a mesos cluster with bridge networking and in distributed mode. I can successfully run a single instance of drill but when I try to scale to 2 instances Drill breaks with java.io.IOException: Failed to create DrillClient: CONNECTION : java.nio.channels.UnresolvedAddressException From my first pass at debugging, I think this is related to the host name that Drill publishes to zookeeper. When run in a bridged docker container, Drill seems to put the hostname from inside the docker container in the zookeeper node which has no meaning outside of that instance. Is there a way to specify the hostname Drill publishes? https://github.com/vadio/docker-apache-drill Thanks, Suhas signature.asc Description: Digital signature
Re: Dockerized Drill with Bridged Network
Hi Suhas, I've been running a Drill cluster on Kubernetes just by patching this: https://github.com/miradatv/drill/commit/ef139a4d8645936a0ecbc8318546fad99ee9609b I think it should also work on Mesos. Cheers, On Mon, Aug 08, 2016 at 12:55:32PM -0700, Suhas Gaddam wrote: Hi, I am trying to deploy a dockerized Drill in a mesos cluster with bridge networking and in distributed mode. I can successfully run a single instance of drill but when I try to scale to 2 instances Drill breaks with java.io.IOException: Failed to create DrillClient: CONNECTION : java.nio.channels.UnresolvedAddressException From my first pass at debugging, I think this is related to the host name that Drill publishes to zookeeper. When run in a bridged docker container, Drill seems to put the hostname from inside the docker container in the zookeeper node which has no meaning outside of that instance. Is there a way to specify the hostname Drill publishes? https://github.com/vadio/docker-apache-drill Thanks, Suhas signature.asc Description: Digital signature
Query hangs on planning
Hi there, I've been stuck with this for a while and I'm not sure if I'm running into a bug or I'm just doing something very wrong. I have this stripped-down version of my query: https://gist.github.com/spacepluk/9ab1e1a0cfec6f0efb298f023f4c805b The data is just a single file with one record (1.5K). Without changing anything, explain takes ~1sec on my machine. But if I uncomment the where clause then it runs for a couple of hours until it runs out of memory. Also if I uncomment the where clause *and* take out the join, then it takes around 30s to plan. Any ideas? Thanks! signature.asc Description: Digital signature
Re: Partition prunning using CURRENT_DATE?
Great, thanks! I'm gonna try swapping the view file and see how it goes. On Thu, Jul 21, 2016 at 11:09:45AM -0500, John Omernik wrote: Yes, I have a view that has the hard coded date in it. It wasn't difficult, and using the REST API was actually fairly neat/clean. I agree with you, it would be nice, but this worked pretty well for me too. (I also wonder if you could just change the raw view def file and update the date) John On Thu, Jul 21, 2016 at 2:26 AM, Oscar Morante <spacep...@gmail.com> wrote: Hi John, I've been following your trail of emails :) Thanks for sharing all that info, it's very useful. I think I'm trying to do something very similar to what you did. I have data flowing from Storm into S3 and I wanted to be able to periodically preprocess/repartition into new folder and then have views to merge recent data from the raw Storm files and old data from the preprocessed/repartitioned folders. These views are intended to be used from Tableau. I guess I can create a small process that checks when a new folder with preprocessed data is available and replaces the appropriate view files with new versions that have the proper date string. But it would be a lot nicer to just do it in the view and have a dumb process executing the periodic queries. How did you "solve" it in the end? If I can ask. Thanks, Oscar On Wed, Jul 20, 2016 at 07:26:20AM -0500, John Omernik wrote: I think I ran into that issue before and (someone will correct me if I am wrong) the issue is that current_date is only materialized AFTER planning. Thus the pruning, which occurs during planning doesn't happen. Is this a programatic query or just something that is being done for users? I know my issue was I wanted a view that showed only the current date, and I struggled to come up with a good solution to that. John On Wed, Jul 20, 2016 at 6:06 AM, Oscar Morante <spacep...@gmail.com> wrote: I'm trying to trigger partition prunning like this: select * from dfs.`json/by-date` where dir0 = cast(current_date as varchar); But apparently, it only works when passing a literal. Am I missing something? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: Partition prunning using CURRENT_DATE?
Hi John, I've been following your trail of emails :) Thanks for sharing all that info, it's very useful. I think I'm trying to do something very similar to what you did. I have data flowing from Storm into S3 and I wanted to be able to periodically preprocess/repartition into new folder and then have views to merge recent data from the raw Storm files and old data from the preprocessed/repartitioned folders. These views are intended to be used from Tableau. I guess I can create a small process that checks when a new folder with preprocessed data is available and replaces the appropriate view files with new versions that have the proper date string. But it would be a lot nicer to just do it in the view and have a dumb process executing the periodic queries. How did you "solve" it in the end? If I can ask. Thanks, Oscar On Wed, Jul 20, 2016 at 07:26:20AM -0500, John Omernik wrote: I think I ran into that issue before and (someone will correct me if I am wrong) the issue is that current_date is only materialized AFTER planning. Thus the pruning, which occurs during planning doesn't happen. Is this a programatic query or just something that is being done for users? I know my issue was I wanted a view that showed only the current date, and I struggled to come up with a good solution to that. John On Wed, Jul 20, 2016 at 6:06 AM, Oscar Morante <spacep...@gmail.com> wrote: I'm trying to trigger partition prunning like this: select * from dfs.`json/by-date` where dir0 = cast(current_date as varchar); But apparently, it only works when passing a literal. Am I missing something? Thanks, signature.asc Description: Digital signature
Partition prunning using CURRENT_DATE?
I'm trying to trigger partition prunning like this: select * from dfs.`json/by-date` where dir0 = cast(current_date as varchar); But apparently, it only works when passing a literal. Am I missing something? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: S3 query takes a long time to start
Hi Jacques, All these technologies are pretty new to me, but I can give it a shot :) Is there any literature that can help me understand how things are set up? Cheers, On Fri, Mar 11, 2016 at 01:42:54PM -0800, Jacques Nadeau wrote: I've been thinking a lot of this. Definitely think there should be a clean fix to this but haven't had the cycles to suggest something. You up for looking at the code and trying to suggest something? thanks! -- Jacques Nadeau CTO and Co-Founder, Dremio On Thu, Mar 10, 2016 at 8:06 AM, Oscar Morante <spacep...@gmail.com> wrote: I've been checking the logs, and I think that the problem is that it's walking through the "directories" in S3 recursively, doing lots of small HTTP requests. My files are organized like this which amplifies the issue: /category/random-hash/year/month/day/hour/data-chunk-000.json.gz The random hash is there to trick S3 into using a different partition/shard for each put [1]. But it looks like this structure is clashing with the way Drill/hadoop.fs.s3a get the list of files. I think that it should be possible to get the complete list of files under a given "directory" (e.g. `/category`) doing just one HTTP query, but I don't know how hard it would be to incorporate that behavior. Any ideas? How are you organizing your S3 files to get good performance? Thanks! [1]: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote: I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes ranging from 1M to 30Mb. Drill is running in aws in 4 m4.xlarge nodes and it's taking around 50 minutes before the query starts executing. Any idea what could be causing this delay? What's the best way to debug this? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: S3 query takes a long time to start
I've been checking the logs, and I think that the problem is that it's walking through the "directories" in S3 recursively, doing lots of small HTTP requests. My files are organized like this which amplifies the issue: /category/random-hash/year/month/day/hour/data-chunk-000.json.gz The random hash is there to trick S3 into using a different partition/shard for each put [1]. But it looks like this structure is clashing with the way Drill/hadoop.fs.s3a get the list of files. I think that it should be possible to get the complete list of files under a given "directory" (e.g. `/category`) doing just one HTTP query, but I don't know how hard it would be to incorporate that behavior. Any ideas? How are you organizing your S3 files to get good performance? Thanks! [1]: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html On Thu, Mar 10, 2016 at 12:27:42PM +0200, Oscar Morante wrote: I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes ranging from 1M to 30Mb. Drill is running in aws in 4 m4.xlarge nodes and it's taking around 50 minutes before the query starts executing. Any idea what could be causing this delay? What's the best way to debug this? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
S3 query takes a long time to start
I'm querying 20G of gzipped JSONs split in ~5600 small files with sizes ranging from 1M to 30Mb. Drill is running in aws in 4 m4.xlarge nodes and it's taking around 50 minutes before the query starts executing. Any idea what could be causing this delay? What's the best way to debug this? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Time intervals overflowing?
I was trying to generate a sequence of timestamps from a sequence of numbers, and run into this: select 596 * interval '1' hour, 597 * interval '1' hour from (values 1); EXPR$0 |EXPR$1 | ---|| PT2145600S |PT-2145767.296S | Notice that the second expression turns into a negative interval. Is that the expected behavior? Cheers, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: Need some help debugging slow S3 download speed
You were right and the time is actually spent on decompressing the files, I must have messed up my initial profiling on that. And I'm already using the native libraries so I guess that's as good as it gets. After checking that an uncompressed file downloads at full speed, I did some more tests and it turns out that the query on the gzipped file is faster even though the download rate is lower! So at least I learned a good lesson about making wrong assumptions, hehe. Anyways, I'm still wondering if there are any other places I can squeeze some extra performance from, and I can only think of: - Maybe using snappy to decompress? I tried to find where the decompression takes place but I couldn't find it. - Use a different compression algorithm, any recommendations based on experience? Thanks! On Sun, Feb 21, 2016 at 04:58:17PM -0800, Jacques Nadeau wrote: The zipped question is a good one. I believe you need to add extra native libraries to get reasonable performance when using gzip files. See if you are seeing this in your logs or out: Unable to load native-hadoop library for your platform If so, probably need to get them setup per here: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/NativeLibraries.html On Sun, Feb 21, 2016 at 11:53 AM, Oscar Morante <spacep...@gmail.com> wrote: I'm still fighting with this, so far I've tried: - Adding the native hadoop libraries. - `s3n://` and `s3a://`. - SSL on/off. - different S3 endpoints. - different values in `drill.exec.buffer.size`. None of these seem to make a difference and `s3cmd` is always ten times faster than Drill to download the same file. Netstat shows that about 800k piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the whole time. I've also noticed that if I cancel the query in the middle, the download speed suddently goes up and matches s3cmd for a while before it stops. Is there anything else that I can try to improve the situation? At the begining I thought that S3 was the bottleneck but everything is pointing to kind of lock in Drill. Or maybe I'm just being unrealistic and asking too much :? Cheers, On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote: Hi there, I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON: select count(somefield) from s3.`test/big.json.gz`; The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s. Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking. Thanks! -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: Need some help debugging slow S3 download speed
I'm still fighting with this, so far I've tried: - Adding the native hadoop libraries. - `s3n://` and `s3a://`. - SSL on/off. - different S3 endpoints. - different values in `drill.exec.buffer.size`. None of these seem to make a difference and `s3cmd` is always ten times faster than Drill to download the same file. Netstat shows that about 800k piling up in Recv-Q during the query, and `s3cmd` is pretty much clean the whole time. I've also noticed that if I cancel the query in the middle, the download speed suddently goes up and matches s3cmd for a while before it stops. Is there anything else that I can try to improve the situation? At the begining I thought that S3 was the bottleneck but everything is pointing to kind of lock in Drill. Or maybe I'm just being unrealistic and asking too much :? Cheers, On Fri, Feb 19, 2016 at 02:27:56PM +0200, Oscar Morante wrote: Hi there, I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON: select count(somefield) from s3.`test/big.json.gz`; The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s. Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking. Thanks! -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Need some help debugging slow S3 download speed
Hi there, I'm experiencing very slow download rates from S3 but only when using Drill. This is testing with only one drillbit and querying a 250Mb gzipped JSON: select count(somefield) from s3.`test/big.json.gz`; The download speed while drill is executing the query is about 5Mb/s. Then if I try downloading the same file from the same environment using `s3cmd` the average speed is about 60Mb/s. Any idea what could be causing such a big difference? I'm not sure what's the best way to debug this, or what are the relevant configuration parameters that I should be tweaking. Thanks! -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: 'DATE - DATE' not supported?
Thanks for the explanation, I'll just modify the queries in the meantime :) On Tue, Feb 09, 2016 at 11:52:12AM -0800, Hsuan Yi Chu wrote: The reason you are seeing this is because the types are known to Drill at the planning time. The planning engine Drill uses is Calcite which does not allow this operation to happen. (as the error message suggests). One workaround you could take is to cover up the type information by cast(... as ANY), such as: SELECT cast(DATE '2009-01-01' as any) - cast(DATE '2009-05-05' as any) FROM (values(1)); On the other hand, the community is doing some work to relax the type constraints (a component of Drill-4372). Hopefully, you could run your original query in the near future. On Tue, Feb 9, 2016 at 6:47 AM, Oscar Morante <spacep...@gmail.com> wrote: Hi there, I'm new to both the list and Drill. I'm doing some tests using Tableau and when I use the `DATEDIFF` function in a calculated field Tableau generates queries that contain stuff like this: SELECT DATE '2009-01-01' - DATE '2009-05-05' FROM (values(1)); ...which throws this error: ExampleExceptionFormatter: exception message was: VALIDATION ERROR: From line 1, column 8 to line 1, column 44: Cannot apply '-' to arguments of type ' - '. Supported form(s): ' - ' ' - ' ' - ' I was looking around and according to the original task (DRILL-549) that kind of query should be supported. Is this a regression? Am I doing something wrong? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
'DATE - DATE' not supported?
Hi there, I'm new to both the list and Drill. I'm doing some tests using Tableau and when I use the `DATEDIFF` function in a calculated field Tableau generates queries that contain stuff like this: SELECT DATE '2009-01-01' - DATE '2009-05-05' FROM (values(1)); ...which throws this error: ExampleExceptionFormatter: exception message was: VALIDATION ERROR: From line 1, column 8 to line 1, column 44: Cannot apply '-' to arguments of type ' - '. Supported form(s): ' - ' ' - ' ' - ' I was looking around and according to the original task (DRILL-549) that kind of query should be supported. Is this a regression? Am I doing something wrong? Thanks, -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Source for drill's calcite?
I'm trying to add support for week and weekdays to `date_trunc` and `date_part`. It seems to be working fine right now except that I need to patch calcite's TimeUnit so that the parser doesn't complain when using `extract` directly. I first tried using the latest calcite but it's too different from the version Drill is using, and `1.4.0-incubation` it doesn't work either. I've tried to look for the source but I can't seem to find `1.4.0-drill-r10` anywhere. Any ideas? PS: Here's the patch so far -> https://gist.github.com/spacepluk/40df5a90ddee2efe1f4a -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature
Re: Source for drill's calcite?
Thanks! :) On Tue, Feb 09, 2016 at 08:30:44AM -0800, Abdel Hakim Deneche wrote: You can find the r10 branch here: https://github.com/mapr/incubator-calcite/tree/DrillCalcite1.4.0 On Tue, Feb 9, 2016 at 8:00 AM, Jason Altekruse <altekruseja...@gmail.com> wrote: I can't find the latest version either, but this is the r9 branch. I don't think any very major changes happened in the last update (it's likely just back-porting a fix from calcite master). So you can base your work on this branch and rebase it when someone points you to the updated branch. https://github.com/dremio/calcite/commits/1.4.0-drill-r9 On Tue, Feb 9, 2016 at 7:46 AM, Oscar Morante <spacep...@gmail.com> wrote: > I'm trying to add support for week and weekdays to `date_trunc` and > `date_part`. It seems to be working fine right now except that I need to > patch calcite's TimeUnit so that the parser doesn't complain when using > `extract` directly. > > I first tried using the latest calcite but it's too different from the > version Drill is using, and `1.4.0-incubation` it doesn't work either. > > I've tried to look for the source but I can't seem to find > `1.4.0-drill-r10` anywhere. > > Any ideas? > > > PS: Here's the patch so far -> > https://gist.github.com/spacepluk/40df5a90ddee2efe1f4a -- Oscar Morante "Self-education is, I firmly believe, the only kind of education there is." -- Isaac Asimov. signature.asc Description: Digital signature