Anyone have any more thoughts on this? Anywhere I can start trying to
troubleshoot?
On Thu, Mar 26, 2015 at 4:13 PM, Adam Gilmore wrote:
> So there are 5 Parquet files, each ~125mb - not sure what I can provide re
> the block locations? I believe it's under the HDFS block size so they
> should
Does EMRFS extend Hadoop FileSystem?
If so, it seems like you would be able to do this by configuring the
FileSystem plugin to use emrfs.
On Tue, Apr 7, 2015 at 3:29 PM, Ted Dunning wrote:
> Looking at the link that you provided, it appears that you are encrypting
> entire data files. That pro
ASAIK, Drill's planner currently does not expose the sort-ness of
underlying data; if the data is pre-sorted, Drill planner would not
recognize that, and still would require a sort operator for sort-based
aggregation. Part of the reason is that Drill does not have a centralized
meta-store, to keep
Looking at the link that you provided, it appears that you are encrypting
entire data files. That probably makes it better to implement this as a
layer in the file access path.
Drill doesn't do this just now, but it would be relatively easy to add, I
think.
On Tue, Apr 7, 2015 at 3:26 PM, Ted
Ahh...
There is no magic that will handle decryption that you can plug into (at
this time).
On Tue, Apr 7, 2015 at 3:02 PM, Ganesha Muthuraman
wrote:
> The situation is this:
> There is client side encrypted data on S3. There is an EMR cluster that
> uses this as EMRFS. The EMR client reaches
Marcin,
They did comment. The answer is that the default is to use hashed
aggregation (which will be faster when there is lots of memory) with the
option to use sort aggregation (which is basically what you were
suggesting).
Did you mean to suggest that your data is already known to be sorted an
@Jacques, thanks for the information - I'm definitely going to check out
that option.
I'm also curious that none of you guys commented on my original idea of
counting distinct values by a simple aggregation of pre-sorted data - is it
because it doesn't make sense to you guys, or because you think
Hi Latha,
In your case if you can see the files when you run show files you can just
run
select * from `/test.csv`;
For example:
This is what my workspace looks like:
"userroot": {
"location": "/user/root",
"writable": true,
"defaultInputFormat": null
},
now:
0: jdbc
That would be great - I'm all listening :)
On Tue, Apr 7, 2015 at 7:22 PM, Ted Dunning wrote:
> On Tue, Apr 7, 2015 at 9:19 AM, Marcin Karpinski
> wrote:
>
> > @ Ted, ideally, I'd like to get exact results, but in case of real
> > problems, we could perhaps settle on approximate counting. Is th
Hello,
Can you please try the query as follows:
> select * from dfs.root.`/test.csv`;
Or
> use dfs.root;
> select * from `test.csv`;
Regards,
Abhishek
On Tue, Apr 7, 2015 at 3:02 PM, Sivasubramaniam, Latha <
latha.sivasubraman...@aspect.com> wrote:
> Hi,
>
> I have the hdfs storage system reg
I assume the plugin is registered as dfs.
Try
select * from dfs.root.`/test.csv`;
You need to use plugin name and workspace name in the query.
Or you can simple go to the specific schema
use dfs.root;
select * from `/test.csv`;
—Andries
On Apr 7, 2015, at 3:02 PM, Sivasubramaniam, Lat
Hi,
I have the hdfs storage system registered, below is the storage plugin details,
it got registered successfully.
{
"type": "file",
"enabled": true,
"connection": "hdfs://:8020/",
"workspaces": {
"root": {
"location": "/user/root/",
"writable": true,
"defaultInput
The situation is this:
There is client side encrypted data on S3. There is an EMR cluster that uses
this as EMRFS. The EMR client reaches out to a custom java class for decrypting
it. EMR does it using the envelope encryption method, documented on AWS.
http://docs.aws.amazon.com/ElasticMapReduce
Ganesh,
When you say the keys are “custom controlled”, does that mean that only special
logic within your Java application allows the data to be properly accessed ?
There are several mechanisms within the S3 API such that encryption/decryption
occur transparently to the application. If your
Yes.
You can integrate the decryption code into a UDF that operates on the
elements.
On Tue, Apr 7, 2015 at 2:41 PM, Ganesha Muthuraman
wrote:
> I am trying to use Drill to read from Amazon S3 where the data is
> Client-side encrypted, meaning the keys to decrypt the data are custom
> controlle
I am trying to use Drill to read from Amazon S3 where the data is Client-side
encrypted, meaning the keys to decrypt the data are custom controlled. Is there
a way I can use drill with this data given that I have a java module that can
be called that will provide the master key to decrypt the da
On Tue, Apr 7, 2015 at 9:19 AM, Marcin Karpinski
wrote:
> @ Ted, ideally, I'd like to get exact results, but in case of real
> problems, we could perhaps settle on approximate counting. Is there already
> such a functionality in Drill?
>
No. But it is very easy to incorporate existing libraries
@ Ted, ideally, I'd like to get exact results, but in case of real
problems, we could perhaps settle on approximate counting. Is there already
such a functionality in Drill?
Cheers,
Marcin
On Tue, Apr 7, 2015 at 5:20 PM, Ted Dunning wrote:
> How precise do your counts need to be? Can you accep
The thing is that with 60m of unique values (which are hashes anyway) the
grouping may not scale well. I've been doing tests on a single machine so
far (32 cores, 64GB RAM) and the COUNT DISTINCT query wouldn't complete
(while other queries gave very encouraging results). So my idea is to
facilitat
Two additional notes here:
Drill can actually do an aggregation using either a hash table based
aggregation or a sort based aggregation. By default, generally the hash
aggregation will be selected first. However, you can disable hash based
aggregation if you specifically think that a sort based
How precise do your counts need to be? Can you accept a fraction of a
percent statistical error?
On Tue, Apr 7, 2015 at 8:11 AM, Aman Sinha wrote:
> Drill already does most of this type of transformation. If you do an
> 'EXPLAIN PLAN FOR '
> you will see that it first does a grouping on the
Drill already does most of this type of transformation. If you do an
'EXPLAIN PLAN FOR '
you will see that it first does a grouping on the column and then applies
the COUNT(column). The first level grouping can be done either based on
sorting or hashing and this is configurable through a system o
Hi Guys,
I have a specific use case for Drill, in which I'd like to be able to count
unique values in columns with tens millions of distinct values. The COUNT
DISTINCT method, unfortunately, does not scale both time- and memory-wise
and the idea is to sort the data beforehand by the values of that
23 matches
Mail list logo