RE: Encryption

2016-05-12 Thread Jordan Birdsell
Thanks, I’ll see if I can find some available cycles From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Thursday, May 12, 2016 6:25 PM To: user@kudu.incubator.apache.org Subject: Re: Encryption On Thu, May 12, 2016 at 9:45 AM, Jordan Birdsell

Re: Sparse Data

2016-05-12 Thread Chris George
I've used kudu with an EAV model for sparse data and that worked extremely well for us with billions of rows and the correct partitioning. -Chris On 5/12/16, 3:21 PM, "Dan Burkert" > wrote: Hi Ben, Kudu doesn't support sparse datasets with many

Re: Sparse Data

2016-05-12 Thread Dan Burkert
Hi Ben, Kudu doesn't support sparse datasets with many columns very well. Kudu's data model looks much more like the relational, structured data model of a traditional SQL database than HBase's data model. Kudu doesn't yet have a map column type (or any nested column types), but we do have

Sparse Data

2016-05-12 Thread Benjamin Kim
Can Kudu handle the use case where sparse data is involved? In many of our processes, we deal with data that can have any number of columns and many previously unknown column names depending on what attributes are brought in at the time. Currently, we use HBase to handle this. Since Kudu is

Re: Partition and Split rows

2016-05-12 Thread Sand Stone
Thanks for the advice, Dan. >Instead, take advantage of the index capability of Primary Keys. Currently I did make the "5-min" field a part of the primary key as well. I am most likely overdoing it. I will play around with the schema and use cases around it. >since each tablet server should only

Re: Partition and Split rows

2016-05-12 Thread Dan Burkert
On Thu, May 12, 2016 at 11:39 AM, Sand Stone wrote: I don't know how Kudu load balance the data across the tablet servers. > Individual tablets are replicated and balanced across all available tablet servers, for more on that see

Re: Partition and Split rows

2016-05-12 Thread Sand Stone
Thanks, Dan. In your scheme, I assume you suggest the range partition on the timestamp. I don't know how Kudu load balance the data across the tablet servers. For example, do I need to pre-calculate every day, a list of 5 minutes apart timestamps at table creation? [assume I have to create a new

Re: Partition and Split rows

2016-05-12 Thread Dan Burkert
Forgot to add the PK specification to the CREATE TABLE, it should have read as follows: CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE) PRIMARY KEY (metric, time); - Dan On Thu, May 12, 2016 at 11:12 AM, Dan Burkert wrote: > > On Thu, May 12, 2016 at

Re: Partition and Split rows

2016-05-12 Thread Sand Stone
> Is the requirement to pre-aggregate by time window? No, I am thinking to create a column say, "minute". It's basically the minute field of the timestamp column(even round to 5-min bucket depending on the needs). So it's a computed column being filled in on data ingestion. My goal is that this

Re: best practices to remove/retire data

2016-05-12 Thread Dan Burkert
On Thu, May 12, 2016 at 8:32 AM, Chris George wrote: > How hard would a predicate based delete be? > Ie ScanDelete or something. > -Chris George > That might be pretty difficult, since it implicitly assumes cross row transactional consistency. If consistency isn't

RE: Encryption

2016-05-12 Thread Jordan Birdsell
Thanks Todd. From a roadmap perspective, do think this will be the recommended way of enabling encryption for Kudu or should a design be put together for something more integrated with Kudu itself? From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Thursday, May 12, 2016 12:31 PM To:

Encryption

2016-05-12 Thread Jordan Birdsell
Hi, A while back we had a thread going about using dm-crypt as a means to encrypt kudu data. Out of curiosity, has any one actually done this? Thanks, Jordan Birdsell

best practices to remove/retire data

2016-05-12 Thread Sand Stone
Hi. Presumably I need to write a program to delete the unwanted rows, say, remove all data older than 3 days, while the table is still ingesting new data. How well will this perform for large tables? Both deletion and ingestion wise. Or for this specific case that I retire data by day, I should