Re: [Drill-Questions] Speed difference between GZ and BZ2
What is the data format within those .gz and .bz2 files ? It is parquet or JSON or plain text (CSV) ? Also, what was this config parameter `store.parquet.compression` set to, when ypu ran your test ? - Khurram On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane wrote: > Awaiting for response.. > > On 30-Jul-2016 3:20 PM, "Shankar Mane" wrote: > > > > > > I am Comparing Querying speed between GZ and BZ2. > > > > Below are the 2 files and their sizes (This 2 files have same data): > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G > > > > > > > > Results: > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ; > > ++--+ > > | channelid | EXPR$1 | > > ++--+ > > | 3 | 977134 | > > | 0 | 836850 | > > | 2 | 3202854 | > > ++--+ > > 3 rows selected (86.034 seconds) > > > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ; > > ++--+ > > | channelid | EXPR$1 | > > ++--+ > > | 3 | 977134 | > > | 0 | 836850 | > > | 2 | 3202854 | > > ++--+ > > 3 rows selected (459.079 seconds) > > > > > > > > Questions: > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ? > > 2. How can we speed to up Bz2. Are there any configuration to do ? > > 3. As bz2 is splittable format, How drill using it ? > > > > > > regards, > > shankar >
Re: [Drill-Questions] Speed difference between GZ and BZ2
It is plain json (1 json per line). Each json message size = ~4kb no. of json messages = ~5 Millions. store.parquet.compression = snappy ( i don't think, this parameter get used. As I am querying select only.) On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz wrote: > What is the data format within those .gz and .bz2 files ? It is parquet or > JSON or plain text (CSV) ? > Also, what was this config parameter `store.parquet.compression` set to, > when ypu ran your test ? > > - Khurram > > On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane > > wrote: > > > Awaiting for response.. > > > > On 30-Jul-2016 3:20 PM, "Shankar Mane" > wrote: > > > > > > > > > > I am Comparing Querying speed between GZ and BZ2. > > > > > > Below are the 2 files and their sizes (This 2 files have same data): > > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G > > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G > > > > > > > > > > > > Results: > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ; > > > ++--+ > > > | channelid | EXPR$1 | > > > ++--+ > > > | 3 | 977134 | > > > | 0 | 836850 | > > > | 2 | 3202854 | > > > ++--+ > > > 3 rows selected (86.034 seconds) > > > > > > > > > > > > 0: jdbc:drill:> select channelid, count(serverTime) from > > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid > ; > > > ++--+ > > > | channelid | EXPR$1 | > > > ++--+ > > > | 3 | 977134 | > > > | 0 | 836850 | > > > | 2 | 3202854 | > > > ++--+ > > > 3 rows selected (459.079 seconds) > > > > > > > > > > > > Questions: > > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ? > > > 2. How can we speed to up Bz2. Are there any configuration to do ? > > > 3. As bz2 is splittable format, How drill using it ? > > > > > > > > > regards, > > > shankar > > >
Re: deploy dockerized drill cluster
So I think part of the problem here is the options on json are stored in the zookeeper. Are you running the Zookeepers in the docker containers, in separate docker containers, or completely separate? For those settings, I'd boot strap the cluster by itself, get it up and running, and then use the ALTER SYSTEM set command, either from SQLLINE or using the Rest API. Then they will persist int he Zookeeper information. I know I asked a similar question, because I wanted new clusters to be able to be setup to use specified admin users. (I.e. I wanted to provide my list of users that were admins prior to the first drill bit starting, and there wasn't an easy way to do that at the time I asked). On Sun, Jul 31, 2016 at 8:07 PM, Jesse Yates wrote: > Ran into this myself today, so gonna try and close the loop. > > Looking at the 1.6 codeline (assuming its the same in 1.7), on startup the > SystemOptionManager on pulls out the sys.options line from the config and > uses that to build the PersistentStoreConfig (which get initialized from > the store in #init()). However, this obviously doesn't take into account > the config level options, only the startup options since those get pulled > out to create everything else. > > It seems simple enough to wrap the SystemOptionManager into a > FallbackOptionManager that falls back to the config file, if its not in the > persistent store (or something like that). > > Unless someone already filed a JIRA on this? > > On Mon, Jul 25, 2016 at 10:05 AM Scott Kinney > wrote: > > > it's still not picking up the store.json* config changes > > > > The only way I can see to set these is with running ALTER SYSTEM query > > after drill api is up. > > > > > > > > Scott Kinney | DevOps > > stem | m 510.282.1299 > > 100 Rollins Road, Millbrae, California 94030 > > > > This e-mail and/or any attachments contain Stem, Inc. confidential and > > proprietary information and material for the sole use of the intended > > recipient(s). Any review, use or distribution that has not been expressly > > authorized by Stem, Inc. is strictly prohibited. If you are not the > > intended recipient, please contact the sender and delete all copies. > Thank > > you. > > > > > > From: John Omernik > > Sent: Monday, July 25, 2016 8:21 AM > > To: user > > Subject: Re: deploy dockerized drill cluster > > > > Try (for the sake of the conversation here) using host networking, and > see > > if it changes how successful your setup is. (I know bridged is > preferred, > > but try the host side and see what happens) > > > > John > > > > On Mon, Jul 25, 2016 at 10:06 AM, Scott Kinney > > wrote: > > > > > I'm running the docker in bridged network mode. > > > > > > > > > > > > Scott Kinney | DevOps > > > stem | m 510.282.1299 > > > 100 Rollins Road, Millbrae, California 94030 > > > > > > This e-mail and/or any attachments contain Stem, Inc. confidential and > > > proprietary information and material for the sole use of the intended > > > recipient(s). Any review, use or distribution that has not been > expressly > > > authorized by Stem, Inc. is strictly prohibited. If you are not the > > > intended recipient, please contact the sender and delete all copies. > > Thank > > > you. > > > > > > > > > From: John Omernik > > > Sent: Sunday, July 24, 2016 8:28 AM > > > To: user > > > Subject: Re: deploy dockerized drill cluster > > > > > > Are you running Drill in host networking or bridged networking? > > > > > > On Sat, Jul 23, 2016 at 1:21 PM, Scott Kinney > > > wrote: > > > > > > > Hm, i must have set those another way in embeded mode. I can't see > > where. > > > > Those settings persist between drill restarts. > > > > > > > > > > > > > > > > > > > > Scott Kinney | DevOps > > > > stem | m 510.282.1299 > > > > 100 Rollins Road, Millbrae, California 94030 > > > > > > > > This e-mail and/or any attachments contain Stem, Inc. confidential > and > > > > proprietary information and material for the sole use of the intended > > > > recipient(s). Any review, use or distribution that has not been > > expressly > > > > authorized by Stem, Inc. is strictly prohibited. If you are not the > > > > intended recipient, please contact the sender and delete all copies. > > > Thank > > > > you. > > > > > > > > > > > > From: Abhishek Girish > > > > Sent: Friday, July 22, 2016 1:57 PM > > > > To: Drill User List > > > > Subject: Re: deploy dockerized drill cluster > > > > > > > > You can set boot level start-up options in drill-override.conf [1]. > > But I > > > > don't think we can do the same with the system options. Someone else > > can > > > > comment if there is a workaround. > > > > > > > > On why it works for you with drill-embedded, is something I'm trying > to > > > > understand. I attempted
Re: concurrent get connection in different node
You may try to connect to a different drillbit in, say, a round-robin fashion, for the queries. See reference: https://drill.apache.org/docs/using-the-jdbc-driver/#using-the-jdbc-url-format-for-a-direct-drillbit-connection On Fri, Jul 29, 2016 at 2:27 AM, qiang li wrote: > We are running query concurrently and get connection through jdbc. > > We found that the querys are not distributed equally in the cluster. That > is some nodes have more querys while others are less. > > This will cause query running slow at the busy node. > > Does there have any way to let the query disributed well? > For example, I have a cluster with 16 nodes. If I running 16 querys > concurrently, the ideal result is each node have one query running. >
Re: Connecting Drill to Azure Data Lake
What failure(s) do you see? Thank you, Sudheesh > On Jul 29, 2016, at 4:07 PM, Kevin Verhoeven > wrote: > > Hi Drill Community, > > Has anyone attempted to connect Drill to the Azure Data Lake? Microsoft has > implemented a WebHDFS API over Azure Data Lake, so Drill should be able to > connect. I'm guessing this will be similar to s3. My initial attempts have > failed, does anyone have any ideas or experience with this connection? > > Regards, > > Kevin >
Re: Pushdown Capabilities with RDBMS
Hi Sudheesh and Zelaine, Thank you both for the reply. Unfortunately I wasn't able to recreate the scenario with a small set of tables and dummy records, but I created the issue 4818 - https://issues.apache.org/jira/browse/DRILL-4818 on Jira and attached the query, the physical plan and the Json Profile related to the case. Again, thank you very much for the support. Best, Marcus 2016-07-15 17:48 GMT-03:00 Marcus Rehm : > Hi all, > > I started to teste Drill and I'm very excited about the possibilities. > > By now I'm trying to map ours databases running on Oracle 11g. After try > some queries I realized that the amount of time Drill takes to complete is > bigger than a general sql client takes. Looking the execution plan I saw > (or understood) that Drill is doing the join of tables and is not pushing > it down to the database. > > Is there any configuration required to it? How can I tell Drill to send to > Oracle the task of doing the join? > > Thanks in Advance. > > Best regards, > Marcus Rehm >
RE: Connecting Drill to Azure Data Lake
I see the following error (Drill 1.5.0): org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: SocketTimeoutException: connect timed out However, I am able to connect to the Azure Data Lake from the server that reported the error using curl, so it does not appear to be a standard connectivity issue. My Storage configuration is very basic: { "type": "file", "enabled": true, "connection": "swebhdfs://azuredatalakestoragename.azuredatalakestore.net", "workspaces": { "root": { "location": "/", "writable": true, "defaultInputFormat": null } }, "formats": { "psv": { "type": "text", "extensions": [ "tbl", "psv" ], "delimiter": "|" }, "csv": { "type": "text", "extensions": [ "csv" ], "delimiter": "," }, "tsv": { "type": "text", "extensions": [ "tsv" ], "delimiter": "\t" }, "txt": { "type": "text", "extensions": [ "txt" ], "delimiter": "," }, "parquet": { "type": "parquet" }, "json": { "type": "json" }, "avro": { "type": "avro" } } } -Original Message- From: Sudheesh Katkam [mailto:skat...@maprtech.com] Sent: Monday, August 1, 2016 11:03 AM To: user@drill.apache.org Subject: Re: Connecting Drill to Azure Data Lake What failure(s) do you see? Thank you, Sudheesh > On Jul 29, 2016, at 4:07 PM, Kevin Verhoeven > wrote: > > Hi Drill Community, > > Has anyone attempted to connect Drill to the Azure Data Lake? Microsoft has > implemented a WebHDFS API over Azure Data Lake, so Drill should be able to > connect. I'm guessing this will be similar to s3. My initial attempts have > failed, does anyone have any ideas or experience with this connection? > > Regards, > > Kevin >