Re: Performance Question

2016-07-06 Thread Dan Burkert
On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim  wrote:

> Over the weekend, the row count is up to <500M. I will give it another few
> days to get to 1B rows. I still get consistent times ~15s for doing row
> counts despite the amount of data growing.
>
> On another note, I got a solicitation email from SnappyData to evaluate
> their product. They claim to be the “Spark Data Store” with tight
> integration with Spark executors. It claims to be an OLTP and OLAP system
> with being an in-memory data store first then to disk. After going to
> several Spark events, it would seem that this is the new “hot” area for
> vendors. They all (MemSQL, Redis, Aerospike, Datastax, etc.) claim to be
> the best "Spark Data Store”. I’m wondering if Kudu will become this too?
> With the performance I’ve seen so far, it would seem that it can be a
> contender. All that is needed is a hardened Spark connector package, I
> would think. The next evaluation I will be conducting is to see if
> SnappyData’s claims are valid by doing my own tests.
>

It's hard to compare Kudu against any other data store without a lot of
analysis and thorough benchmarking, but it is certainly a goal of Kudu to
be a great platform for ingesting and analyzing data through Spark.  Up
till this point most of the Spark work has been community driven, but more
thorough integration testing of the Spark connector is going to be a focus
going forward.

- Dan



> Cheers,
> Ben
>
>
>
> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
>
> Hi Benjamin,
>
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
>
> Todd
>
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  wrote:
>
>> Hi Todd,
>>
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
>> impressed. Compared to HBase, read and write performance are better. Write
>> performance has the greatest improvement (> 4x), while read is > 1.5x.
>> Albeit, these are only preliminary tests. Do you know of a way to really do
>> some conclusive tests? I want to see if I can match your results on my 50
>> node cluster.
>>
>> Thanks,
>> Ben
>>
>> On May 30, 2016, at 10:33 AM, Todd Lipcon  wrote:
>>
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim  wrote:
>>
>>> Todd,
>>>
>>> It sounds like Kudu can possibly top or match those numbers put out by
>>> Aerospike. Do you have any performance statistics published or any
>>> instructions as to measure them myself as good way to test? In addition,
>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0
>>> where support will be built in?
>>>
>>
>> We don't have a lot of benchmarks published yet, especially on the write
>> side. I've found that thorough cross-system benchmarks are very difficult
>> to do fairly and accurately, and often times users end up misguided if they
>> pay too much attention to them :) So, given a finite number of developers
>> working on Kudu, I think we've tended to spend more time on the project
>> itself and less time focusing on "competition". I'm sure there are use
>> cases where Kudu will beat out Aerospike, and probably use cases where
>> Aerospike will beat Kudu as well.
>>
>> From my perspective, it would be great if you can share some details of
>> your workload, especially if there are some areas you're finding Kudu
>> lacking. Maybe we can spot some easy code changes we could make to improve
>> performance, or suggest a tuning variable you could change.
>>
>> -Todd
>>
>>
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon  wrote:
>>>
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim 
>>> wrote:
>>>
 Hi Mike,

 First of all, thanks for the link. It looks like an interesting read. I
 checked that Aerospike is currently at version 3.8.2.3, and in the article,
 they are evaluating version 3.5.4. The main thing that impressed me was
 their claim that they can beat Cassandra and HBase by 8x for writing and
 25x for reading. Their big claim to fame is that Aerospike can write 1M
 records per second with only 50 nodes. I wanted to see if this is real.

>>>
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well,
>>> depending on the size of your records and the insertion order. I've been
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do
>>> better.
>>>
>>>

 To answer your questions, we have a DMP with user profiles with many
 attributes. We create segmentation information off of these attributes to
 classify them. Then, we can target advertising appropriately for our sales
 department. Much of the data processing is for applying 

Re: Performance Question

2016-07-06 Thread Dan Burkert
On Mon, Jul 4, 2016 at 2:46 AM, 袁康(梓悠)  wrote:

> How can I delete data in kudu table wiht spark  (not delete the table at
> all)?
>

We do not currently have a way to delete a Kudu table through the spark
connector, but you should be able to instantiate a Kudu client and delete
the table that way.  We have discussed making one of the spark write modes
do a truncate operation, but nothing has been implemented.

 - Dan


> --
> 发件人:Todd Lipcon 
> 发送时间:2016年7月2日(星期六) 02:44
> 收件人:user 
> 主 题:Re: Performance Question
>
> On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim  wrote:
> Hi Todd,
>
> I changed the key to be what you suggested, and I can’t tell the
> difference since it was already fast. But, I did get more numbers.
>
> Yea, you won't see a substantial difference until you're inserting
> billions of rows, etc, and the keys and/or bloom filters no longer fit in
> cache.
>
>
> > 104M rows in Kudu table
> - read: 8s
> - count: 16s
> - aggregate: 9s
>
> The time to read took much longer from 0.2s to 8s, counts were the same
> 16s, and aggregate queries look longer from 6s to 9s.
>
> I’m still impressed.
>
> We aim to please ;-) If you have any interest in writing up these
> experiments as a blog post, would be cool to post them for others to learn
> from.
>
> -Todd
>
> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
>
> Hi Benjamin,
>
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
>
> Todd
>
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  wrote:
> Hi Todd,
>
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
> impressed. Compared to HBase, read and write performance are better. Write
> performance has the greatest improvement (> 4x), while read is > 1.5x.
> Albeit, these are only preliminary tests. Do you know of a way to really do
> some conclusive tests? I want to see if I can match your results on my 50
> node cluster.
>
> Thanks,
> Ben
>
> On May 30, 2016, at 10:33 AM, Todd Lipcon  wrote:
>
> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim  wrote:
> Todd,
>
> It sounds like Kudu can possibly top or match those numbers put out by
> Aerospike. Do you have any performance statistics published or any
> instructions as to measure them myself as good way to test? In addition,
> this will be a test using Spark, so should I wait for Kudu version 0.9.0
> where support will be built in?
>
> We don't have a lot of benchmarks published yet, especially on the write
> side. I've found that thorough cross-system benchmarks are very difficult
> to do fairly and accurately, and often times users end up misguided if they
> pay too much attention to them :) So, given a finite number of developers
> working on Kudu, I think we've tended to spend more time on the project
> itself and less time focusing on "competition". I'm sure there are use
> cases where Kudu will beat out Aerospike, and probably use cases where
> Aerospike will beat Kudu as well.
>
> From my perspective, it would be great if you can share some details of
> your workload, especially if there are some areas you're finding Kudu
> lacking. Maybe we can spot some easy code changes we could make to improve
> performance, or suggest a tuning variable you could change.
>
> -Todd
>
>
> On May 27, 2016, at 9:19 PM, Todd Lipcon  wrote:
>
> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim  wrote:
> Hi Mike,
>
> First of all, thanks for the link. It looks like an interesting read. I
> checked that Aerospike is currently at version 3.8.2.3, and in the article,
> they are evaluating version 3.5.4. The main thing that impressed me was
> their claim that they can beat Cassandra and HBase by 8x for writing and
> 25x for reading. Their big claim to fame is that Aerospike can write 1M
> records per second with only 50 nodes. I wanted to see if this is real.
>
> 1M records per second on 50 nodes is pretty doable by Kudu as well,
> depending on the size of your records and the insertion order. I've been
> playing with a ~70 node cluster recently and seen 1M+ writes/second
> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
> with pretty old HDD-only nodes. I think newer flash-based nodes could do
> better.
>
>
> To answer your questions, we have a DMP with user profiles with many
> attributes. We create segmentation information off of these attributes to
> classify them. Then, we can target advertising appropriately for our sales
> department. Much of the data processing is for applying models on all or if
> not most of every profile’s attributes to find similarities (nearest
> neighbor/clustering) over a large number of rows when batch 

Re: Spark on Kudu

2016-06-17 Thread Dan Burkert
Hi Ben,

To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I
do not think we support that at this point.  I haven't looked deeply into
it, but we may hit issues specifying Kudu-specific options (partitioning,
column encoding, etc.).  Probably issues that can be worked through
eventually, though.  If you are interested in contributing to Kudu, this is
an area that could obviously use improvement!  Most or all of our Spark
features have been completely community driven to date.


> I am assuming that more Spark support along with semantic changes below
> will be incorporated into Kudu 0.9.1.
>

As a rule we do not release new features in patch releases, but the good
news is that we are releasing regularly, and our next scheduled release is
for the August timeframe (see JD's roadmap
<https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
email
about what we are aiming to include).  Also, Cloudera does publish snapshot
versions of the Spark connector here
<https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so
the jars are available if you don't mind using snapshots.


> Anyone know of a better way to make unique primary keys other than using
> UUID to make every row unique if there is no unique column (or combination
> thereof) to use.
>

Not that I know of.  In general it's pretty rare to have a dataset without
a natural primary key (even if it's just all of the columns), but in those
cases UUID is a good solution.


> This is what I am using. I know auto incrementing is coming down the line
> (don’t know when), but is there a way to simulate this in Kudu using Spark
> out of curiosity?
>

To my knowledge there is no plan to have auto increment in Kudu.
Distributed, consistent, auto incrementing counters is a difficult problem,
and I don't think there are any known solutions that would be fast enough
for Kudu (happy to be proven wrong, though!).

- Dan


>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote:
>
> I'm not sure exactly what the semantics will be, but at least one of them
> will be upsert.  These modes come from spark, and they were really designed
> for file-backed storage and not table storage.  We may want to do append =
> upsert, and overwrite = truncate + insert.  I think that may match the
> normal spark semantics more closely.
>
> - Dan
>
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Dan,
>>
>> Thanks for the information. That would mean both “append” and “overwrite”
>> modes would be combined or not needed in the future.
>>
>> Cheers,
>> Ben
>>
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com> wrote:
>>
>> Right now append uses an update Kudu operation, which requires the row
>> already be present in the table. Overwrite maps to insert.  Kudu very
>> recently got upsert support baked in, but it hasn't yet been integrated
>> into the Spark connector.  So pretty soon these sharp edges will get a lot
>> better, since upsert is the way to go for most spark workloads.
>>
>> - Dan
>>
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows
>>> in 64s. I would assume that now I can use the “overwrite” mode on existing
>>> data. Now, I have to find answers to these questions. What would happen if
>>> I “append” to the data in the Kudu table if the data already exists? What
>>> would happen if I “overwrite” existing data when the DataFrame has data in
>>> it that does not exist in the Kudu table? I need to evaluate the best way
>>> to simulate the UPSERT behavior in HBase because this is what our use case
>>> is.
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>>
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Now, I’m getting this error when trying to write to the table.
>>>
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>>
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>>
>>> java.lang.RuntimeException: failed to wri

Re: Performance Question

2016-06-15 Thread Dan Burkert
Adding partition splits when range partitioning is done via the
CreateTableOptions.addSplitRow

method.
You can find more about the different partitioning options in the schema
design guide .
We generally recommend sticking to hash partitioning if possible, since you
don't have to determine your own split rows.

- Dan

On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim  wrote:

> Todd,
>
> I think the locality is not within our setup. We have the compute cluster
> with Spark, YARN, etc. on its own, and we have the storage cluster with
> HBase, Kudu, etc. on another. We beefed up the hardware specs on the
> compute cluster and beefed up storage capacity on the storage cluster. We
> got this setup idea from the Databricks folks. I do have a question. I
> created the table to use range partition on columns. I see that if I use
> hash partition I can set the number of splits, but how do I do that using
> range (50 nodes * 10 = 500 splits)?
>
> Thanks,
> Ben
>
>
> On Jun 15, 2016, at 9:11 AM, Todd Lipcon  wrote:
>
> Awesome use case. One thing to keep in mind is that spark parallelism will
> be limited by the number of tablets. So, you might want to split into 10 or
> so buckets per node to get the best query throughput.
>
> Usually if you run top on some machines while running the query you can
> see if it is fully utilizing the cores.
>
> Another known issue right now is that spark locality isn't working
> properly on replicated tables so you will use a lot of network traffic. For
> a perf test you might want to try a table with replication count 1
> On Jun 15, 2016 5:26 PM, "Benjamin Kim"  wrote:
>
> Hi Todd,
>
> I did a simple test of our ad events. We stream using Spark Streaming
> directly into HBase, and the Data Analysts/Scientists do some
> insight/discovery work plus some reports generation. For the reports, we
> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data
> currency store of choice is DataFrames.
>
> The schema is around 83 columns wide where most are of the string data
> type.
>
> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip",
> "user_id", "mappable_id",
> "cookie_status", "profile_status", "user_status", "previous_timestamp",
> "user_agent", "referer",
> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id",
> "creative_id",
> "location_id", “pcamp_id",
> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip",
> "isp", "line_speed",
> "gender", "year_of_birth", "behaviors_read", "behaviors_written",
> "key_value_pairs", "acamp_candidates",
> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip",
> "pixel_id", “video_id",
> "video_network_id", "video_time_watched", "video_percentage_watched",
> "video_media_type",
> "video_player_iframed", "video_player_in_view", "video_player_width",
> "video_player_height",
> "conversion_valid_sale", "conversion_sale_amount",
> "conversion_commission_amount", "conversion_step",
> "conversion_currency", "conversion_attribution", "conversion_offer_id",
> "custom_info", "frequency",
> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
> "optimizer_creative_id", "optimizer_ecpm", "impression_id",
> "diagnostic_data",
> "user_profile_mapping_source", "latitude", "longitude", "area_code",
> "gmt_offset", "in_dst",
> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires",
> "timestamp_iso", "reference_id",
> "identity_organization", "identity_method"
>
> Most queries are like counts of how many users use what browser, how many
> are unique users, etc. The part that scares most users is when it comes to
> joining this data with other dimension/3rd party events tables because of
> shear size of it.
>
> We do what most companies do, similar to what I saw in earlier
> presentations of Kudu. We dump data out of HBase into partitioned Parquet
> tables to make query performance manageable.
>
> I will coordinate with a data scientist today to do some tests. He is
> working on identity matching/record linking of users from 2 domains: US and
> Singapore, using probabilistic deduping algorithms. I will load the data
> from ad events from both countries, and let him run his process against
> this data in Kudu. I hope this will “wow” the team.
>
> Thanks,
> Ben
>
> On Jun 15, 2016, at 12:47 AM, Todd Lipcon  wrote:
>
> Hi Benjamin,
>
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
>
> Todd
>
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim"  wrote:
>
>> Hi Todd,
>>
>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
>> impressed. Compared to HBase, read and write 

Re: Spark on Kudu

2016-06-14 Thread Dan Burkert
I'm not sure exactly what the semantics will be, but at least one of them
will be upsert.  These modes come from spark, and they were really designed
for file-backed storage and not table storage.  We may want to do append =
upsert, and overwrite = truncate + insert.  I think that may match the
normal spark semantics more closely.

- Dan

On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Dan,
>
> Thanks for the information. That would mean both “append” and “overwrite”
> modes would be combined or not needed in the future.
>
> Cheers,
> Ben
>
> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com> wrote:
>
> Right now append uses an update Kudu operation, which requires the row
> already be present in the table. Overwrite maps to insert.  Kudu very
> recently got upsert support baked in, but it hasn't yet been integrated
> into the Spark connector.  So pretty soon these sharp edges will get a lot
> better, since upsert is the way to go for most spark workloads.
>
> - Dan
>
> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in
>> 64s. I would assume that now I can use the “overwrite” mode on existing
>> data. Now, I have to find answers to these questions. What would happen if
>> I “append” to the data in the Kudu table if the data already exists? What
>> would happen if I “overwrite” existing data when the DataFrame has data in
>> it that does not exist in the Kudu table? I need to evaluate the best way
>> to simulate the UPSERT behavior in HBase because this is what our use case
>> is.
>>
>> Thanks,
>> Ben
>>
>>
>>
>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>> Hi,
>>
>> Now, I’m getting this error when trying to write to the table.
>>
>> import scala.collection.JavaConverters._
>> val key_seq = Seq(“my_id")
>> val key_list = List(“my_id”).asJava
>> kuduContext.createTable(tableName, df.schema, key_seq, new
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>
>> df.write
>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>> .mode("overwrite")
>> .kudu
>>
>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to
>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not
>> found (error 0)Not found: key not found (error 0)Not found: key not found
>> (error 0)Not found: key not found (error 0)
>>
>> Does the key field need to be first in the DataFrame?
>>
>> Thanks,
>> Ben
>>
>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote:
>>
>>
>>
>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Dan,
>>>
>>> Thanks! It got further. Now, how do I set the Primary Key to be a
>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>
>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>
>>> java.lang.IllegalArgumentException: Table partitioning must be specified
>>> using setRangePartitionColumns or addHashPartitions
>>>
>>
>> Yep.  The `Seq("my_id")` part of that call is specifying the set of
>> primary key columns, so in this case you have specified the single PK
>> column "my_id".  The `addHashPartitions` call adds hash partitioning to the
>> table, in this case over the column "my_id" (which is good, it must be over
>> one or more PK columns, so in this case "my_id" is the one and only valid
>> combination).  However, the call to `addHashPartition` also takes the
>> number of buckets as the second param.  You shouldn't get the
>> IllegalArgumentException as long as you are specifying either
>> `addHashPartitions` or `setRangePartitionColumns`.
>>
>> - Dan
>>
>>
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote:
>>>
>>> Looks like we're missing an import statement in that example.  Could you
>>> try:
>>>
>>> import org.kududb.client._
>>>
>>> and try again?
>>>
>>> - Dan
>>>
>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com>
>

Re: Spark on Kudu

2016-06-14 Thread Dan Burkert
Right now append uses an update Kudu operation, which requires the row
already be present in the table. Overwrite maps to insert.  Kudu very
recently got upsert support baked in, but it hasn't yet been integrated
into the Spark connector.  So pretty soon these sharp edges will get a lot
better, since upsert is the way to go for most spark workloads.

- Dan

On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> I tried to use the “append” mode, and it worked. Over 3.8 million rows in
> 64s. I would assume that now I can use the “overwrite” mode on existing
> data. Now, I have to find answers to these questions. What would happen if
> I “append” to the data in the Kudu table if the data already exists? What
> would happen if I “overwrite” existing data when the DataFrame has data in
> it that does not exist in the Kudu table? I need to evaluate the best way
> to simulate the UPSERT behavior in HBase because this is what our use case
> is.
>
> Thanks,
> Ben
>
>
>
> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
> Hi,
>
> Now, I’m getting this error when trying to write to the table.
>
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_list = List(“my_id”).asJava
> kuduContext.createTable(tableName, df.schema, key_seq, new
> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>
> df.write
> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
> .mode("overwrite")
> .kudu
>
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to
> Kudu; sample errors: Not found: key not found (error 0)Not found: key not
> found (error 0)Not found: key not found (error 0)Not found: key not found
> (error 0)Not found: key not found (error 0)
>
> Does the key field need to be first in the DataFrame?
>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote:
>
>
>
> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Dan,
>>
>> Thanks! It got further. Now, how do I set the Primary Key to be a
>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>
>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>
>> java.lang.IllegalArgumentException: Table partitioning must be specified
>> using setRangePartitionColumns or addHashPartitions
>>
>
> Yep.  The `Seq("my_id")` part of that call is specifying the set of
> primary key columns, so in this case you have specified the single PK
> column "my_id".  The `addHashPartitions` call adds hash partitioning to the
> table, in this case over the column "my_id" (which is good, it must be over
> one or more PK columns, so in this case "my_id" is the one and only valid
> combination).  However, the call to `addHashPartition` also takes the
> number of buckets as the second param.  You shouldn't get the
> IllegalArgumentException as long as you are specifying either
> `addHashPartitions` or `setRangePartitionColumns`.
>
> - Dan
>
>
>>
>> Thanks,
>> Ben
>>
>>
>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote:
>>
>> Looks like we're missing an import statement in that example.  Could you
>> try:
>>
>> import org.kududb.client._
>>
>> and try again?
>>
>> - Dan
>>
>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> I encountered an error trying to create a table based on the
>>> documentation from a DataFrame.
>>>
>>> :49: error: not found: type CreateTableOptions
>>>   kuduContext.createTable(tableName, df.schema, Seq("key"),
>>> new CreateTableOptions().setNumReplicas(1))
>>>
>>> Is there something I’m missing?
>>>
>>> Thanks,
>>> Ben
>>>
>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>>> wrote:
>>>
>>> It's only in Cloudera's maven repo:
>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>
>>> J-D
>>>
>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com>
>>> wrote:
>>>
>>>> Hi J-D,
>>>>
>>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar
>>>> for spark-shell to us

Re: Spark on Kudu

2016-06-14 Thread Dan Burkert
On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Dan,
>
> Thanks! It got further. Now, how do I set the Primary Key to be a
> column(s) in the DataFrame and set the partitioning? Is it like this?
>
> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new
> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>
> java.lang.IllegalArgumentException: Table partitioning must be specified
> using setRangePartitionColumns or addHashPartitions
>

Yep.  The `Seq("my_id")` part of that call is specifying the set of primary
key columns, so in this case you have specified the single PK column
"my_id".  The `addHashPartitions` call adds hash partitioning to the table,
in this case over the column "my_id" (which is good, it must be over one or
more PK columns, so in this case "my_id" is the one and only valid
combination).  However, the call to `addHashPartition` also takes the
number of buckets as the second param.  You shouldn't get the
IllegalArgumentException as long as you are specifying either
`addHashPartitions` or `setRangePartitionColumns`.

- Dan


>
> Thanks,
> Ben
>
>
> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote:
>
> Looks like we're missing an import statement in that example.  Could you
> try:
>
> import org.kududb.client._
>
> and try again?
>
> - Dan
>
> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> I encountered an error trying to create a table based on the
>> documentation from a DataFrame.
>>
>> :49: error: not found: type CreateTableOptions
>>   kuduContext.createTable(tableName, df.schema, Seq("key"),
>> new CreateTableOptions().setNumReplicas(1))
>>
>> Is there something I’m missing?
>>
>> Thanks,
>> Ben
>>
>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>> wrote:
>>
>> It's only in Cloudera's maven repo:
>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>
>> J-D
>>
>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Hi J-D,
>>>
>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
>>> spark-shell to use. Can you show me where to find it?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>>> wrote:
>>>
>>> What's in this doc is what's gonna get released:
>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>
>>> J-D
>>>
>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>>
>>>> Will this be documented with examples once 0.9.0 comes out?
>>>>
>>>> Thanks,
>>>> Ben
>>>>
>>>>
>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>>>> wrote:
>>>>
>>>> It will be in 0.9.0.
>>>>
>>>> J-D
>>>>
>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Chris,
>>>>>
>>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>>>
>>>>> Thanks,
>>>>> Ben
>>>>>
>>>>>
>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com>
>>>>> wrote:
>>>>>
>>>>> There is some code in review that needs some more refinement.
>>>>> It will allow upsert/insert from a dataframe using the datasource api.
>>>>> It will also allow the creation and deletion of tables from a dataframe
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/
>>>>>
>>>>> Example usages will look something like:
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc
>>>>>
>>>>> -Chris George
>>>>>
>>>>>
>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
>>>>>
>>>>> Can someone tell me what the state is of this Spark work?
>>>>>
>>>>> Also, does anyone have any sample code on how to update/insert data in
>>>>> Kudu using DataFrames?
>>>>>
&g

Re: Spark on Kudu

2016-06-14 Thread Dan Burkert
Looks like we're missing an import statement in that example.  Could you
try:

import org.kududb.client._

and try again?

- Dan

On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim  wrote:

> I encountered an error trying to create a table based on the documentation
> from a DataFrame.
>
> :49: error: not found: type CreateTableOptions
>   kuduContext.createTable(tableName, df.schema, Seq("key"),
> new CreateTableOptions().setNumReplicas(1))
>
> Is there something I’m missing?
>
> Thanks,
> Ben
>
> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans 
> wrote:
>
> It's only in Cloudera's maven repo:
> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>
> J-D
>
> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim  wrote:
>
>> Hi J-D,
>>
>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for
>> spark-shell to use. Can you show me where to find it?
>>
>> Thanks,
>> Ben
>>
>>
>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans 
>> wrote:
>>
>> What's in this doc is what's gonna get released:
>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>
>> J-D
>>
>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim  wrote:
>>
>>> Will this be documented with examples once 0.9.0 comes out?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans 
>>> wrote:
>>>
>>> It will be in 0.9.0.
>>>
>>> J-D
>>>
>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim 
>>> wrote:
>>>
 Hi Chris,

 Will all this effort be rolled into 0.9.0 and be ready for use?

 Thanks,
 Ben


 On May 18, 2016, at 9:01 AM, Chris George 
 wrote:

 There is some code in review that needs some more refinement.
 It will allow upsert/insert from a dataframe using the datasource api.
 It will also allow the creation and deletion of tables from a dataframe
 http://gerrit.cloudera.org:8080/#/c/2992/

 Example usages will look something like:
 http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc

 -Chris George


 On 5/18/16, 9:45 AM, "Benjamin Kim"  wrote:

 Can someone tell me what the state is of this Spark work?

 Also, does anyone have any sample code on how to update/insert data in
 Kudu using DataFrames?

 Thanks,
 Ben


 On Apr 13, 2016, at 8:22 AM, Chris George 
 wrote:

 SparkSQL cannot support these type of statements but we may be able to
 implement similar functionality through the api.
 -Chris

 On 4/12/16, 5:19 PM, "Benjamin Kim"  wrote:

 It would be nice to adhere to the SQL:2003 standard for an “upsert” if
 it were to be implemented.

 MERGE INTO table_name USING table_reference ON (condition)
  WHEN MATCHED THEN
  UPDATE SET column1 = value1 [, column2 = value2 ...]
  WHEN NOT MATCHED THEN
  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])

 Cheers,
 Ben

 On Apr 11, 2016, at 12:21 PM, Chris George 
 wrote:

 I have a wip kuduRDD that I made a few months ago. I pushed it into
 gerrit if you want to take a look.
 http://gerrit.cloudera.org:8080/#/c/2754/
 It does pushdown predicates which the existing input formatter based
 rdd does not.

 Within the next two weeks I’m planning to implement a datasource for
 spark that will have pushdown predicates and insertion/update functionality
 (need to look more at cassandra and the hbase datasource for best way to do
 this) I agree that server side upsert would be helpful.
 Having a datasource would give us useful data frames and also make
 spark sql usable for kudu.

 My reasoning for having a spark datasource and not using Impala is: 1.
 We have had trouble getting impala to run fast with high concurrency when
 compared to spark 2. We interact with datasources which do not integrate
 with impala. 3. We have custom sql query planners for extended sql
 functionality.

 -Chris George


 On 4/11/16, 12:22 PM, "Jean-Daniel Cryans"  wrote:

 You guys make a convincing point, although on the upsert side we'll
 need more support from the servers. Right now all you can do is an INSERT
 then, if you get a dup key, do an UPDATE. I guess we could at least add an
 API on the client side that would manage it, but it wouldn't be atomic.

 J-D

 On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra 
 wrote:

> It's pretty simple, actually.  I need to support versioned datasets in
> a Spark SQL environment.  

Re: Proposal: remove default partitioning for new tables

2016-05-26 Thread Dan Burkert
Hi all,

Thanks for the feedback!  We've made this change, and it will be part of
the upcoming 0.9 release.  Going forward, all create table calls must have
partitioning specified.  Existing tables will not be affected.

- Dan

On Fri, May 20, 2016 at 6:41 AM, Jordan Birdsell <
jordan.birdsell.k...@statefarm.com> wrote:

> +1 ...this is a great recommendation
>
> -Original Message-
> From: Sand Stone [mailto:sand.m.st...@gmail.com]
> Sent: Thursday, May 19, 2016 10:39 PM
> To: user@kudu.incubator.apache.org
> Cc: d...@kudu.incubator.apache.org
> Subject: Re: Proposal: remove default partitioning for new tables
>
> Agreed that this is a sensible API change.
>
> On Thu, May 19, 2016 at 4:07 PM, Abhi Basu <9000r...@gmail.com> wrote:
>
> > I think this a very reasonable feature request. I have recently started
> > working with Kudu and the "default" behavior has already tripped me up a
> > couple times.
> >
> > Thanks,
> >
> > Abhi
> >
> > On Thu, May 19, 2016 at 4:03 PM, Dan Burkert <danburk...@apache.org>
> > wrote:
> >
> >> Hi all,
> >>
> >> One of the issues that trips up new Kudu users is the uncertainty about
> >> how partitioning works, and how to use partitioning effectively.  Much
> of
> >> this can be addressed with better documentation and explanatory
> materials,
> >> and that should be an area of focus leading up to our 1.0 release.
> However,
> >> the default partitioning behavior is suboptimal, and changing the
> default
> >> could lead to significantly less user confusion and frustration.
> Currently,
> >> when creating a new table, Kudu defaults to using only a single tablet,
> >> which is a known anti-pattern.  This can be painful for users who
> create a
> >> table assuming Kudu will have good defaults, and begin loading data
> only to
> >> find out later that they will need to recreate the table with
> partitioning
> >> to achieve good results.
> >>
> >> A better default partitioning strategy might be hash partitioning over
> >> the primary key columns, with a number of hash buckets based on the
> number
> >> of tablet servers (perhaps something like 3x the number of tablet
> >> servers).  This would alleviate the worst scalability issues with the
> >> current default, however it has a few downsides of its own. Hash
> >> partitioning is not appropriate for every use case, and any
> rule-of-thumb
> >> number of tablets we could come up with will not always be optimal.
> >>
> >> Given that there is no bullet-proof default, and that changing
> >> partitioning strategy after table creation is impossible, and changing
> the
> >> default partitioning strategy is a backwards incompatible change, I
> propose
> >> we remove the default altogether.  Users would be required to explicitly
> >> specify the table partitioning during creation, and failing to do so
> would
> >> result in an illegal argument error.  Users who really do want only a
> >> single tablet will still be able to do so by explicitly configuring
> range
> >> partitioning with no split rows.
> >>
> >> I'd like to get community feedback on whether this seems like a good
> >> direction to take.  I have put together a patch, you can check out the
> >> changes to test files to see what it looks like to add partitioning
> >> explicitly in cases where the default was being relied on.
> >> http://gerrit.cloudera.org:8080/#/c/3131/
> >>
> >> - Dan
> >>
> >
> >
> >
> > --
> > Abhi Basu
> >
>


Re: Partition and Split rows

2016-05-16 Thread Dan Burkert
Hi Amit, responses inline


> Q: Can I fetch the kudu Timestamp data from Tableu/Pentaho for reporting
> and analytics purpose or I need Int64 datatype only.
>

Is Tableu/Pentaho using Impala to query?  If so, see the answers below.


> Q: Are you going to provide the Kimpala merge or kudu timestamp support in
> Impala/Tableu in near future.
>

TIMESTAMP typed columns should work in Impala/Kudu as of the 0.8 release
(the latest), however I know there were a few bugs in the previous
releases.  If your using the latest and are still seeing errors, please
send the error and/or file an issue on JIRA.


> Q: At this moment, instead of Timestamp we are using Int64 type in kudu,
> will it be equally helpful for performance, if we use the partition and
> split by explanation given for timestamp datatype? e.g. can we get a better
> performance for a composite key on a given kudu table defined as
> 'metric(s),Int64(has timestamp data) and with a partition on Int64 column
> with Split by clause?
>

Internally, the TIMESTAMP type is just an alias to INT64, so it has the
exact same performance characteristics.  The only difference is how the
values are displayed in log messages and on the web UI.

- Dan


> On Sat, May 7, 2016 at 9:20 PM, Dan Burkert <d...@cloudera.com> wrote:
>
>> Hi Sand,
>>
>> I've been working on some diagrams to help explain some of the more
>> advanced partitioning types, it's attached.   Still pretty rough at this
>> point, but the goal is to clean it up and move it into the Kudu
>> documentation proper.  I'm interested to hear what kind of time series you
>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>> series, you can follow progress here
>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>> additional ideas I'd love to hear them.  You may also be interested in a
>> small project that a JD and I have been working on in the past week to
>> build an OpenTSDB style store on top of Kudu, you can find it here
>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited at
>> this point.
>>
>> - Dan
>>
>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.st...@gmail.com>
>> wrote:
>>
>>> Thanks. Will read.
>>>
>>> Given that I am researching time series data, row locality is crucial
>>> :-)
>>>
>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcry...@apache.org>
>>> wrote:
>>>
>>>> We do have non-covering range partitions coming in the next few months,
>>>> here's the design (in review):
>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>
>>>> The "Background & Motivation" section should give you a good idea of
>>>> why I'm mentioning this.
>>>>
>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>> could be good enough.
>>>>
>>>> J-D
>>>>
>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.st...@gmail.com>
>>>> wrote:
>>>>
>>>>> Makes sense.
>>>>>
>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>> after the table is created. Now, I have to "think ahead" to pre-create the
>>>>> range buckets.
>>>>>
>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>> jdcry...@apache.org> wrote:
>>>>>
>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>
>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>
>>>>>> J-D
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.st...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>> specify the split rows?
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.st...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks, Misty. The "advanced" impala example helped.

Re: Sparse Data

2016-05-12 Thread Dan Burkert
Hi Ben,

Kudu doesn't support sparse datasets with many columns very well.  Kudu's
data model looks much more like the relational, structured data model of a
traditional SQL database than HBase's data model.  Kudu doesn't yet have a
map column type (or any nested column types), but we do have BINARY typed
columns if you can handle your own serialization. Oftentimes, however, it's
better to restructure the data so that it can fit Kudu's structure better.
If you can give more information about your usage patterns (especially
details queries you wish to optimize for) I can perhaps give better info.

- Dan

On Thu, May 12, 2016 at 2:08 PM, Benjamin Kim  wrote:

> Can Kudu handle the use case where sparse data is involved? In many of our
> processes, we deal with data that can have any number of columns and many
> previously unknown column names depending on what attributes are brought in
> at the time. Currently, we use HBase to handle this. Since Kudu is based on
> HBase, can it do the same? Or, do we have to use a map data type column for
> this?
>
> Thanks,
> Ben
>
>


Re: Partition and Split rows

2016-05-12 Thread Dan Burkert
On Thu, May 12, 2016 at 11:39 AM, Sand Stone <sand.m.st...@gmail.com> wrote:

I don't know how Kudu load balance the data across the tablet servers.
>

Individual tablets are replicated and balanced across all available tablet
servers, for more on that see
http://getkudu.io/docs/schema_design.html#data-distribution.



> For example, do I need to pre-calculate every day, a list of 5 minutes
> apart timestamps at table creation? [assume I have to create a new table
> every day].
>

If you wish to range partition on the time column, then yes, currently you
must specify the splits upfront during table creation (but this will change
with the non-covering range partitions work).


>
> My hope, with the additional 5-min column, and use it as the range
> partition column, is that so I could spread the data evenly across the
> tablet servers.
>

I don't think this is meaningfully different than range partitioning on the
full time column with splits every 5 minutes.


> Also, since 5-min interval data are always colocated together, the read
> query could be efficient too.
>

Data colocation is a function of the partitioning and indexing.  As I
mentioned before, if you have timestamp as part of your primary key then
you can guarantee that scans specifying a time range are efficient. Overall
it sounds like you are attempting to get fast scans by creating many fine
grained partitions, as you might with Parquet.  This won't be an efficient
strategy in Kudu, since each tablet server should only have on the order of
10-20 tablets.  Instead, take advantage of the index capability of Primary
Keys.

- Dan


> On Thu, May 12, 2016 at 11:13 AM, Dan Burkert <d...@cloudera.com> wrote:
>
>> Forgot to add the PK specification to the CREATE TABLE, it should have
>> read as follows:
>>
>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
>> PRIMARY KEY (metric, time);
>>
>> - Dan
>>
>>
>> On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <d...@cloudera.com> wrote:
>>
>>>
>>> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sand.m.st...@gmail.com>
>>> wrote:
>>>
>>>> > Is the requirement to pre-aggregate by time window?
>>>> No, I am thinking to create a column say, "minute". It's basically the
>>>> minute field of the timestamp column(even round to 5-min bucket depending
>>>> on the needs). So it's a computed column being filled in on data ingestion.
>>>> My goal is that this field would help with data filtering at read/query
>>>> time, say select certain projection at minute 10-15, to speed up the read
>>>> queries.
>>>>
>>>
>>> In many cases, Kudu can do his for you without having to add special
>>> columns.  The requirements are that the timestamp is part of the primary
>>> key, and any columns that come before the timestamp in the primary key (if
>>> it's a compound PK), have equality predicates.  So for instance, if you
>>> create a table such as:
>>>
>>> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>>>
>>> then queries such as
>>>
>>> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
>>> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>>>
>>> Then only the data for that 5 minute time window will be read from
>>> disk.  If the query didn't have the equality predicate on the 'metric'
>>> column, then it would do a much bigger scan + filter operation.  If you
>>> want more background on how this is achieved, check out the partition
>>> pruning design doc:
>>> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
>>> .
>>>
>>> - Dan
>>>
>>>
>>>
>>>> Thanks for the info., I will follow them.
>>>>
>>>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <d...@cloudera.com> wrote:
>>>>
>>>>> Hey Sand,
>>>>>
>>>>> Sorry for the delayed response.  I'm not quite following your use
>>>>> case.  Is the requirement to pre-aggregate by time window? I don't think
>>>>> Kudu can help you directly with that (nothing built in), but you could
>>>>> always create a separate table to store the pre-aggregated values.  As far
>>>>> as applying functions to do row splits, that is an interesting idea, but I
>>>>> think once Kudu has support for range bounds (the non-covering range
>>>>> partition design doc linked above), you can simply cre

Re: Partition and Split rows

2016-05-12 Thread Dan Burkert
Forgot to add the PK specification to the CREATE TABLE, it should have read
as follows:

CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE)
PRIMARY KEY (metric, time);

- Dan


On Thu, May 12, 2016 at 11:12 AM, Dan Burkert <d...@cloudera.com> wrote:

>
> On Thu, May 12, 2016 at 11:05 AM, Sand Stone <sand.m.st...@gmail.com>
> wrote:
>
>> > Is the requirement to pre-aggregate by time window?
>> No, I am thinking to create a column say, "minute". It's basically the
>> minute field of the timestamp column(even round to 5-min bucket depending
>> on the needs). So it's a computed column being filled in on data ingestion.
>> My goal is that this field would help with data filtering at read/query
>> time, say select certain projection at minute 10-15, to speed up the read
>> queries.
>>
>
> In many cases, Kudu can do his for you without having to add special
> columns.  The requirements are that the timestamp is part of the primary
> key, and any columns that come before the timestamp in the primary key (if
> it's a compound PK), have equality predicates.  So for instance, if you
> create a table such as:
>
> CREATE TABLE metrics (metric STRING, time TIMESTAMP, value DOUBLE);
>
> then queries such as
>
> SELECT time, value FROM metrics WHERE metric = "my-metric" AND time >
> 2016-05-01T00:00 AND time < 2016-05-01T00:05
>
> Then only the data for that 5 minute time window will be read from disk.
> If the query didn't have the equality predicate on the 'metric' column,
> then it would do a much bigger scan + filter operation.  If you want more
> background on how this is achieved, check out the partition pruning design
> doc:
> https://github.com/apache/incubator-kudu/blob/master/docs/design-docs/scan-optimization-partition-pruning.md
> .
>
> - Dan
>
>
>
>> Thanks for the info., I will follow them.
>>
>> On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <d...@cloudera.com> wrote:
>>
>>> Hey Sand,
>>>
>>> Sorry for the delayed response.  I'm not quite following your use case.
>>> Is the requirement to pre-aggregate by time window? I don't think Kudu can
>>> help you directly with that (nothing built in), but you could always create
>>> a separate table to store the pre-aggregated values.  As far as applying
>>> functions to do row splits, that is an interesting idea, but I think once
>>> Kudu has support for range bounds (the non-covering range partition design
>>> doc linked above), you can simply create the bounds where the function
>>> would have put them.  For example, if you want a partition for every five
>>> minutes, you can create the bounds accordingly.
>>>
>>> Earlier this week I gave a talk on timeseries in Kudu, I've included
>>> some slides that may be interesting to you.  Additionally, you may want to
>>> check out https://github.com/danburkert/kudu-ts, it's a very young
>>>  (not feature complete) metrics layer on top of Kudu, it may give you some
>>> ideas.
>>>
>>> - Dan
>>>
>>> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.st...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>>>> system works.
>>>>
>>>> As for things in my mind. Take the schema of <host,metric,time,...>,
>>>> say, I am interested in data for the past 5 mins, 10 mins, etc. Or,
>>>> aggregate at 5 mins interval for the past 3 days, 7 days, ... Looks like I
>>>> need to introduce a special 5-min bar column, use that column to do range
>>>> partition to spread data across the tablet servers so that I could leverage
>>>> parallel filtering.
>>>>
>>>> The cost of this extra column (INT8) is not ideal but not too bad
>>>> either (storage cost wise, compression should do wonders). So I am thinking
>>>> whether it would be better to take "functions" as row split instead of only
>>>> constants. Of course if business requires to drop down to 1-min bar, the
>>>> data has to be re-sharded again. So a more cost effective way of doing this
>>>> on a production cluster would be good.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <d...@cloudera.com> wrote:
>>>>
>>>>> Hi Sand,
>>>>>
>>>>> I've been working on some diagrams to help explain some of the more
>>>>> advanced partitioning types, it's attac

Re: best practices to remove/retire data

2016-05-12 Thread Dan Burkert
On Thu, May 12, 2016 at 8:32 AM, Chris George 
wrote:

> How hard would a predicate based delete be?
> Ie ScanDelete or something.
> -Chris George
>

That might be pretty difficult, since it implicitly assumes cross row
transactional consistency.  If consistency isn't required you can simulate
it today by starting the scan and issuing deletes for each result.

- Dan



>
> On 5/12/16, 9:24 AM, "Jean-Daniel Cryans"  wrote:
>
> Hi,
>
> Right now this use case is more difficult than it needs to be. In your
> previous thread, "Partition and Split rows", we talked about non-covering
> range partition and this is something that would help your use case a lot.
> Basically, you could create partitions that cover full days, and everyday
> you could delete the old partitions while creating the next day's. Deleting
> a partition is really quick and efficient compared to manually deleting
> individual rows.
>
> Until this is available I'd do this with multiple table, but it's a mess
> to handle as you described.
>
> Hope this helps,
>
> J-D
>
> On Thu, May 12, 2016 at 8:16 AM, Sand Stone 
> wrote:
>
>> Hi. Presumably I need to write a program to delete the unwanted rows,
>> say, remove all data older than 3 days, while the table is still ingesting
>> new data.
>>
>> How well will this perform for large tables? Both deletion and ingestion
>> wise.
>>
>> Or for this specific case that I retire data by day, I should create a
>> new table per day. However then the users have to be aware of the table
>> naming scheme somehow. If a mention policy is changed. all the client side
>> code might have to change (sure we can have one level of indirection to
>> minimize the pain).
>>
>> Thanks.
>>
>
>


Re: Please welcome Binglin Chang as a Kudu committer and PPMC member

2016-04-05 Thread Dan Burkert
Welcome, Binglin!

- Dan

On Mon, Apr 4, 2016 at 9:28 PM, Mike Percy  wrote:

> Welcome aboard, Binglin! Looking forward to your continued contributions to
> the project!
>
> Best,
> Mike
>
> On Mon, Apr 4, 2016 at 9:11 PM, Todd Lipcon  wrote:
>
> > Hi Kudu community,
> >
> > On behalf of the Apache Kudu PPMC, I am please to announce that Binglin
> > Chang has been elected as our newest PPMC member and committer. Binglin
> has
> > been contributing to Kudu steadily over the last year in many different
> > ways -- from speaking at conferences, to finding and fixing bugs, to
> adding
> > new features, Binglin has been a great asset to the project.
> >
> > Please join me in congratulating Binglin! Thank you for your work so far
> > and we hope to see your involvement continue and grow over the coming
> > years.
> >
> > -Todd and the rest of the PPMC
> >
>


Re: sparkContext won't stop when using spark-kudu

2016-03-19 Thread Dan Burkert
Hi Darren,

I found the culprit, and I've put up a patch here
<http://gerrit.cloudera.org:8080/#/c/2571/>.  Should make it into the next
release (0.8.0).  Until then stopping the shell with the 'exit' command or
-C should do the trick.

- Dan

On Tue, Mar 15, 2016 at 12:04 PM, Dan Burkert <d...@cloudera.com> wrote:

> Hi Darren,
>
> I was able to repro locally.  I think our connector is not implementing
> some shutdown hooks.  I'm going to track down a Spark expert and figure out
> exactly what we should be doing to have a more graceful shutdown.
>
> - Dan
>