Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-01 Thread Khurram Faraaz
What is the data format within those .gz and .bz2 files ? It is parquet or
JSON or plain text (CSV) ?
Also, what was this config parameter `store.parquet.compression` set to,
when ypu ran your test ?

- Khurram

On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane 
wrote:

> Awaiting for response..
>
> On 30-Jul-2016 3:20 PM, "Shankar Mane"  wrote:
>
> >
>
> > I am Comparing Querying speed between GZ and BZ2.
> >
> > Below are the 2 files and their sizes (This 2 files have same data):
> > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> >
> >
> >
> > Results:
> >
> > 0: jdbc:drill:> select channelid, count(serverTime) from
> dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > ++--+
> > | channelid  |  EXPR$1  |
> > ++--+
> > | 3  | 977134   |
> > | 0  | 836850   |
> > | 2  | 3202854  |
> > ++--+
> > 3 rows selected (86.034 seconds)
> >
> >
> >
> > 0: jdbc:drill:> select channelid, count(serverTime) from
> dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid ;
> > ++--+
> > | channelid  |  EXPR$1  |
> > ++--+
> > | 3  | 977134   |
> > | 0  | 836850   |
> > | 2  | 3202854  |
> > ++--+
> > 3 rows selected (459.079 seconds)
> >
> >
> >
> > Questions:
> > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > 3. As bz2 is splittable format, How drill using it ?
> >
> >
> > regards,
> > shankar
>


Re: [Drill-Questions] Speed difference between GZ and BZ2

2016-08-01 Thread Shankar Mane
It is plain json (1 json per line).
Each json message size = ~4kb
no. of json messages = ~5 Millions.

store.parquet.compression = snappy ( i don't think, this parameter get
used. As I am querying select only.)


On Mon, Aug 1, 2016 at 3:27 PM, Khurram Faraaz  wrote:

> What is the data format within those .gz and .bz2 files ? It is parquet or
> JSON or plain text (CSV) ?
> Also, what was this config parameter `store.parquet.compression` set to,
> when ypu ran your test ?
>
> - Khurram
>
> On Sun, Jul 31, 2016 at 11:17 PM, Shankar Mane  >
> wrote:
>
> > Awaiting for response..
> >
> > On 30-Jul-2016 3:20 PM, "Shankar Mane" 
> wrote:
> >
> > >
> >
> > > I am Comparing Querying speed between GZ and BZ2.
> > >
> > > Below are the 2 files and their sizes (This 2 files have same data):
> > > kafka_3_25-Jul-2016-12a.json.gz = 1.8G
> > > kafka_3_25-Jul-2016-12a.json.bz2= 1.1G
> > >
> > >
> > >
> > > Results:
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-gz/kafka_3_25-Jul-2016-12a.json.gz` group by channelid ;
> > > ++--+
> > > | channelid  |  EXPR$1  |
> > > ++--+
> > > | 3  | 977134   |
> > > | 0  | 836850   |
> > > | 2  | 3202854  |
> > > ++--+
> > > 3 rows selected (86.034 seconds)
> > >
> > >
> > >
> > > 0: jdbc:drill:> select channelid, count(serverTime) from
> > dfs.`/tmp/stest-bz2/kafka_3_25-Jul-2016-12a.json.bz2` group by channelid
> ;
> > > ++--+
> > > | channelid  |  EXPR$1  |
> > > ++--+
> > > | 3  | 977134   |
> > > | 0  | 836850   |
> > > | 2  | 3202854  |
> > > ++--+
> > > 3 rows selected (459.079 seconds)
> > >
> > >
> > >
> > > Questions:
> > > 1. As per above Test: Gz is 6x fast than Bz2. why is that ?
> > > 2. How can we speed to up Bz2.  Are there any configuration to do ?
> > > 3. As bz2 is splittable format, How drill using it ?
> > >
> > >
> > > regards,
> > > shankar
> >
>


Re: deploy dockerized drill cluster

2016-08-01 Thread John Omernik
So I think part of the problem here is the options on json are stored in
the zookeeper. Are you running the Zookeepers in the docker containers, in
separate docker containers, or completely separate?  For those settings,
I'd boot strap the cluster by itself, get it up and running, and then use
the ALTER SYSTEM set  command, either from SQLLINE or using the Rest API.
Then they will persist int he Zookeeper information.

I know I asked a similar question, because I wanted new clusters to be able
to be setup to use specified admin users. (I.e. I wanted to provide my list
of users that were admins prior to the first drill bit starting, and there
wasn't an easy way to do that at the time I asked).



On Sun, Jul 31, 2016 at 8:07 PM, Jesse Yates 
wrote:

> Ran into this myself today, so gonna try and close the loop.
>
> Looking at the 1.6 codeline (assuming its the same in 1.7), on startup the
> SystemOptionManager on pulls out the sys.options line from the config and
> uses that to build the PersistentStoreConfig (which get initialized from
> the store in #init()). However, this obviously doesn't take into account
> the config level options, only the startup options since those get pulled
> out to create everything else.
>
> It seems simple enough to wrap the SystemOptionManager into a
> FallbackOptionManager that falls back to the config file, if its not in the
> persistent store (or something like that).
>
> Unless someone already filed a JIRA on this?
>
> On Mon, Jul 25, 2016 at 10:05 AM Scott Kinney 
> wrote:
>
> > it's still not picking up the store.json* config changes
> >
> > The only way I can see to set these is with running ALTER SYSTEM query
> > after drill api is up.
> >
> >
> > 
> > Scott Kinney | DevOps
> > stem   |   m  510.282.1299
> > 100 Rollins Road, Millbrae, California 94030
> >
> > This e-mail and/or any attachments contain Stem, Inc. confidential and
> > proprietary information and material for the sole use of the intended
> > recipient(s). Any review, use or distribution that has not been expressly
> > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > intended recipient, please contact the sender and delete all copies.
> Thank
> > you.
> >
> > 
> > From: John Omernik 
> > Sent: Monday, July 25, 2016 8:21 AM
> > To: user
> > Subject: Re: deploy dockerized drill cluster
> >
> > Try (for the sake of the conversation here) using host networking, and
> see
> > if it changes how successful your setup is.  (I know bridged is
> preferred,
> > but try the host side and see what happens)
> >
> > John
> >
> > On Mon, Jul 25, 2016 at 10:06 AM, Scott Kinney 
> > wrote:
> >
> > > I'm running the docker in bridged network mode.
> > >
> > >
> > > 
> > > Scott Kinney | DevOps
> > > stem   |   m  510.282.1299
> > > 100 Rollins Road, Millbrae, California 94030
> > >
> > > This e-mail and/or any attachments contain Stem, Inc. confidential and
> > > proprietary information and material for the sole use of the intended
> > > recipient(s). Any review, use or distribution that has not been
> expressly
> > > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > > intended recipient, please contact the sender and delete all copies.
> > Thank
> > > you.
> > >
> > > 
> > > From: John Omernik 
> > > Sent: Sunday, July 24, 2016 8:28 AM
> > > To: user
> > > Subject: Re: deploy dockerized drill cluster
> > >
> > > Are you running Drill in host networking or bridged networking?
> > >
> > > On Sat, Jul 23, 2016 at 1:21 PM, Scott Kinney 
> > > wrote:
> > >
> > > > Hm, i must have set those another way in embeded mode. I can't see
> > where.
> > > > Those settings persist between drill restarts.
> > > >
> > > >
> > > >
> > > > 
> > > > Scott Kinney | DevOps
> > > > stem   |   m  510.282.1299
> > > > 100 Rollins Road, Millbrae, California 94030
> > > >
> > > > This e-mail and/or any attachments contain Stem, Inc. confidential
> and
> > > > proprietary information and material for the sole use of the intended
> > > > recipient(s). Any review, use or distribution that has not been
> > expressly
> > > > authorized by Stem, Inc. is strictly prohibited. If you are not the
> > > > intended recipient, please contact the sender and delete all copies.
> > > Thank
> > > > you.
> > > >
> > > > 
> > > > From: Abhishek Girish 
> > > > Sent: Friday, July 22, 2016 1:57 PM
> > > > To: Drill User List
> > > > Subject: Re: deploy dockerized drill cluster
> > > >
> > > > You can set boot level start-up options in drill-override.conf [1].
> > But I
> > > > don't think we can do the same with the system options. Someone else
> > can
> > > > comment if there is a workaround.
> > > >
> > > > On why it works for you with drill-embedded, is something I'm trying
> to
> > > > understand. I attempted

Re: concurrent get connection in different node

2016-08-01 Thread Dechang Gu
You may try to connect to a different drillbit in, say, a round-robin
fashion, for the
queries.  See reference:
https://drill.apache.org/docs/using-the-jdbc-driver/#using-the-jdbc-url-format-for-a-direct-drillbit-connection


On Fri, Jul 29, 2016 at 2:27 AM, qiang li  wrote:

> We are running query concurrently and get connection through jdbc.
>
> We found that the querys are not distributed equally in the cluster. That
> is some nodes have more  querys while others are less.
>
> This will cause query running slow at the busy node.
>
> Does there have any way to let the query disributed well?
> For example, I have a cluster with 16 nodes. If I running 16 querys
> concurrently, the ideal result is each node have one query running.
>


Re: Connecting Drill to Azure Data Lake

2016-08-01 Thread Sudheesh Katkam
What failure(s) do you see?

Thank you,
Sudheesh

> On Jul 29, 2016, at 4:07 PM, Kevin Verhoeven  
> wrote:
> 
> Hi Drill Community,
> 
> Has anyone attempted to connect Drill to the Azure Data Lake? Microsoft has 
> implemented a WebHDFS API over Azure Data Lake, so Drill should be able to 
> connect. I'm guessing this will be similar to s3. My initial attempts have 
> failed, does anyone have any ideas or experience with this connection?
> 
> Regards,
> 
> Kevin
> 



Re: Pushdown Capabilities with RDBMS

2016-08-01 Thread Marcus Rehm
Hi Sudheesh and Zelaine,

Thank you both for the reply.

Unfortunately I wasn't able to recreate the scenario with a small set of
tables and dummy records, but I created the issue 4818 -
https://issues.apache.org/jira/browse/DRILL-4818 on Jira and attached the
query, the physical plan and the Json Profile related to the case.

Again, thank you very much for the support.

Best,
Marcus



2016-07-15 17:48 GMT-03:00 Marcus Rehm :

> Hi all,
>
> I started to teste Drill and I'm very excited about the possibilities.
>
> By now I'm trying to map ours databases running on Oracle 11g. After try
> some queries I realized that the amount of time Drill takes to complete is
> bigger than a general sql client takes. Looking the execution plan I saw
> (or understood) that Drill is doing the join of tables and is not pushing
> it down to the database.
>
> Is there any configuration required to it? How can I tell Drill to send to
> Oracle the task of doing the join?
>
> Thanks in Advance.
>
> Best regards,
> Marcus Rehm
>


RE: Connecting Drill to Azure Data Lake

2016-08-01 Thread Kevin Verhoeven
I see the following error (Drill 1.5.0):

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: 
SocketTimeoutException: connect timed out

However, I am able to connect to the Azure Data Lake from the server that 
reported the error using curl, so it does not appear to be a standard 
connectivity issue. 

My Storage configuration is very basic:

{
  "type": "file",
  "enabled": true,
  "connection": "swebhdfs://azuredatalakestoragename.azuredatalakestore.net",
  "workspaces": {
"root": {
  "location": "/",
  "writable": true,
  "defaultInputFormat": null
}
  },
  "formats": {
"psv": {
  "type": "text",
  "extensions": [
"tbl",
"psv"
  ],
  "delimiter": "|"
},
"csv": {
  "type": "text",
  "extensions": [
"csv"
  ],
  "delimiter": ","
},
"tsv": {
  "type": "text",
  "extensions": [
"tsv"
  ],
  "delimiter": "\t"
},
"txt": {
  "type": "text",
  "extensions": [
"txt"
  ],
  "delimiter": ","
},
"parquet": {
  "type": "parquet"
},
"json": {
  "type": "json"
},
"avro": {
  "type": "avro"
}
  }
}

-Original Message-
From: Sudheesh Katkam [mailto:skat...@maprtech.com] 
Sent: Monday, August 1, 2016 11:03 AM
To: user@drill.apache.org
Subject: Re: Connecting Drill to Azure Data Lake

What failure(s) do you see?

Thank you,
Sudheesh

> On Jul 29, 2016, at 4:07 PM, Kevin Verhoeven  
> wrote:
> 
> Hi Drill Community,
> 
> Has anyone attempted to connect Drill to the Azure Data Lake? Microsoft has 
> implemented a WebHDFS API over Azure Data Lake, so Drill should be able to 
> connect. I'm guessing this will be similar to s3. My initial attempts have 
> failed, does anyone have any ideas or experience with this connection?
> 
> Regards,
> 
> Kevin
>