Re: Drill favouring a particular Drillbit

2015-04-07 Thread Adam Gilmore
Anyone have any more thoughts on this? Anywhere I can start trying to troubleshoot? On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore wrote: > So there are 5 Parquet files, each ~125mb - not sure what I can provide re > the block locations? I believe it's under the HDFS block size so they > should

Re: Drill to query Client-side encrypted data from S3

2015-04-07 Thread Steven Phillips
Does EMRFS extend Hadoop FileSystem? If so, it seems like you would be able to do this by configuring the FileSystem plugin to use emrfs. On Tue, Apr 7, 2015 at 3:29 PM, Ted Dunning wrote: > Looking at the link that you provided, it appears that you are encrypting > entire data files. That pro

Re: Counting large numbers of unique values

2015-04-07 Thread Jinfeng Ni
ASAIK, Drill's planner currently does not expose the sort-ness of underlying data; if the data is pre-sorted, Drill planner would not recognize that, and still would require a sort operator for sort-based aggregation. Part of the reason is that Drill does not have a centralized meta-store, to keep

Re: Drill to query Client-side encrypted data from S3

2015-04-07 Thread Ted Dunning
Looking at the link that you provided, it appears that you are encrypting entire data files. That probably makes it better to implement this as a layer in the file access path. Drill doesn't do this just now, but it would be relatively easy to add, I think. On Tue, Apr 7, 2015 at 3:26 PM, Ted

Re: Drill to query Client-side encrypted data from S3

2015-04-07 Thread Ted Dunning
Ahh... There is no magic that will handle decryption that you can plug into (at this time). On Tue, Apr 7, 2015 at 3:02 PM, Ganesha Muthuraman wrote: > The situation is this: > There is client side encrypted data on S3. There is an EMR cluster that > uses this as EMRFS. The EMR client reaches

Re: Counting large numbers of unique values

2015-04-07 Thread Ted Dunning
Marcin, They did comment. The answer is that the default is to use hashed aggregation (which will be faster when there is lots of memory) with the option to use sort aggregation (which is basically what you were suggesting). Did you mean to suggest that your data is already known to be sorted an

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
@Jacques, thanks for the information - I'm definitely going to check out that option. I'm also curious that none of you guys commented on my original idea of counting distinct values by a simple aggregation of pre-sorted data - is it because it doesn't make sense to you guys, or because you think

Re: Unable to query data from hdfs

2015-04-07 Thread Ramana Inukonda
Hi Latha, In your case if you can see the files when you run show files you can just run select * from `/test.csv`; For example: This is what my workspace looks like: "userroot": { "location": "/user/root", "writable": true, "defaultInputFormat": null }, now: 0: jdbc

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
That would be great - I'm all listening :) On Tue, Apr 7, 2015 at 7:22 PM, Ted Dunning wrote: > On Tue, Apr 7, 2015 at 9:19 AM, Marcin Karpinski > wrote: > > > @ Ted, ideally, I'd like to get exact results, but in case of real > > problems, we could perhaps settle on approximate counting. Is th

Re: Unable to query data from hdfs

2015-04-07 Thread Abhishek Girish
Hello, Can you please try the query as follows: > select * from dfs.root.`/test.csv`; Or > use dfs.root; > select * from `test.csv`; Regards, Abhishek On Tue, Apr 7, 2015 at 3:02 PM, Sivasubramaniam, Latha < latha.sivasubraman...@aspect.com> wrote: > Hi, > > I have the hdfs storage system reg

Re: Unable to query data from hdfs

2015-04-07 Thread Andries Engelbrecht
I assume the plugin is registered as dfs. Try select * from dfs.root.`/test.csv`; You need to use plugin name and workspace name in the query. Or you can simple go to the specific schema use dfs.root; select * from `/test.csv`; —Andries On Apr 7, 2015, at 3:02 PM, Sivasubramaniam, Lat

Unable to query data from hdfs

2015-04-07 Thread Sivasubramaniam, Latha
Hi, I have the hdfs storage system registered, below is the storage plugin details, it got registered successfully. { "type": "file", "enabled": true, "connection": "hdfs://:8020/", "workspaces": { "root": { "location": "/user/root/", "writable": true, "defaultInput

RE: Drill to query Client-side encrypted data from S3

2015-04-07 Thread Ganesha Muthuraman
The situation is this: There is client side encrypted data on S3. There is an EMR cluster that uses this as EMRFS. The EMR client reaches out to a custom java class for decrypting it. EMR does it using the envelope encryption method, documented on AWS. http://docs.aws.amazon.com/ElasticMapReduce

Re: Drill to query Client-side encrypted data from S3

2015-04-07 Thread David Tucker
Ganesh, When you say the keys are “custom controlled”, does that mean that only special logic within your Java application allows the data to be properly accessed ? There are several mechanisms within the S3 API such that encryption/decryption occur transparently to the application. If your

Re: Drill to query Client-side encrypted data from S3

2015-04-07 Thread Ted Dunning
Yes. You can integrate the decryption code into a UDF that operates on the elements. On Tue, Apr 7, 2015 at 2:41 PM, Ganesha Muthuraman wrote: > I am trying to use Drill to read from Amazon S3 where the data is > Client-side encrypted, meaning the keys to decrypt the data are custom > controlle

Drill to query Client-side encrypted data from S3

2015-04-07 Thread Ganesha Muthuraman
I am trying to use Drill to read from Amazon S3 where the data is Client-side encrypted, meaning the keys to decrypt the data are custom controlled. Is there a way I can use drill with this data given that I have a java module that can be called that will provide the master key to decrypt the da

Re: Counting large numbers of unique values

2015-04-07 Thread Ted Dunning
On Tue, Apr 7, 2015 at 9:19 AM, Marcin Karpinski wrote: > @ Ted, ideally, I'd like to get exact results, but in case of real > problems, we could perhaps settle on approximate counting. Is there already > such a functionality in Drill? > No. But it is very easy to incorporate existing libraries

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
@ Ted, ideally, I'd like to get exact results, but in case of real problems, we could perhaps settle on approximate counting. Is there already such a functionality in Drill? Cheers, Marcin On Tue, Apr 7, 2015 at 5:20 PM, Ted Dunning wrote: > How precise do your counts need to be? Can you accep

Re: Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
The thing is that with 60m of unique values (which are hashes anyway) the grouping may not scale well. I've been doing tests on a single machine so far (32 cores, 64GB RAM) and the COUNT DISTINCT query wouldn't complete (while other queries gave very encouraging results). So my idea is to facilitat

Re: Counting large numbers of unique values

2015-04-07 Thread Jacques Nadeau
Two additional notes here: Drill can actually do an aggregation using either a hash table based aggregation or a sort based aggregation. By default, generally the hash aggregation will be selected first. However, you can disable hash based aggregation if you specifically think that a sort based

Re: Counting large numbers of unique values

2015-04-07 Thread Ted Dunning
How precise do your counts need to be? Can you accept a fraction of a percent statistical error? On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha wrote: > Drill already does most of this type of transformation. If you do an > 'EXPLAIN PLAN FOR ' > you will see that it first does a grouping on the

Re: Counting large numbers of unique values

2015-04-07 Thread Aman Sinha
Drill already does most of this type of transformation. If you do an 'EXPLAIN PLAN FOR ' you will see that it first does a grouping on the column and then applies the COUNT(column). The first level grouping can be done either based on sorting or hashing and this is configurable through a system o

Counting large numbers of unique values

2015-04-07 Thread Marcin Karpinski
Hi Guys, I have a specific use case for Drill, in which I'd like to be able to count unique values in columns with tens millions of distinct values. The COUNT DISTINCT method, unfortunately, does not scale both time- and memory-wise and the idea is to sort the data beforehand by the values of that