Re: Performance Question

2016-07-18 Thread Benjamin Kim
Todd, I upgraded, deleted the table and recreated it again because it was unaccessible, and re-introduced the downed tablet server after clearing out all kudu directories. The Spark Streaming job is repopulating again. Thanks, Ben > On Jul 18, 2016, at 10:32 AM, Todd Lipcon

Re: Performance Question

2016-07-18 Thread Todd Lipcon
On Mon, Jul 18, 2016 at 10:31 AM, Benjamin Kim wrote: > Todd, > > Thanks for the info. I was going to upgrade after the testing, but now, it > looks like I will have to do it earlier than expected. > > I will do the upgrade, then resume. > OK, sounds good. The upgrade

Re: Performance Question

2016-07-18 Thread Benjamin Kim
Todd, Thanks for the info. I was going to upgrade after the testing, but now, it looks like I will have to do it earlier than expected. I will do the upgrade, then resume. Cheers, Ben > On Jul 18, 2016, at 10:29 AM, Todd Lipcon wrote: > > Hi Ben, > > Any chance that you

Re: Performance Question

2016-07-18 Thread Todd Lipcon
Hi Ben, Any chance that you are running Kudu 0.9.0 instead of 0.9.1? There's a known serious bug in 0.9.0 which can cause this kind of corruption. Assuming that you are running with replication count 3 this time, you should be able to move aside that tablet metadata file and start the server. It

Re: Performance Question

2016-07-18 Thread Benjamin Kim
During my re-population of the Kudu table, I am getting this error trying to restart a tablet server after it went down. The job that populates this table has been running for over a week. [libprotobuf ERROR google/protobuf/message_lite.cc:123] Can't parse message of type

Re: Performance Question

2016-07-11 Thread Benjamin Kim
Todd, It’s no problem to start over again. But, a tool like that would be helpful. Gaps in data can be accommodated for by just back filling. Thanks, Ben > On Jul 11, 2016, at 10:47 AM, Todd Lipcon wrote: > > On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim

Re: Performance Question

2016-07-11 Thread Todd Lipcon
On Mon, Jul 11, 2016 at 10:40 AM, Benjamin Kim wrote: > Todd, > > I had it at one replica. Do I have to recreate? > We don't currently have the ability to "accept data loss" on a tablet (or set of tablets). If the machine is gone for good, then currently the only easy way to

Re: Performance Question

2016-07-11 Thread Benjamin Kim
Todd, I had it at one replica. Do I have to recreate? Thanks, Ben > On Jul 11, 2016, at 10:37 AM, Todd Lipcon wrote: > > Hey Ben, > > Is the table that you're querying replicated? Or was it created with only one > replica per tablet? > > -Todd > > On Mon, Jul 11, 2016

Re: Performance Question

2016-07-11 Thread Todd Lipcon
Hey Ben, Is the table that you're querying replicated? Or was it created with only one replica per tablet? -Todd On Mon, Jul 11, 2016 at 10:35 AM, Benjamin Kim wrote: > Over the weekend, a tablet server went down. It’s not coming back up. So, > I decommissioned it and removed

Re: Performance Question

2016-07-08 Thread Benjamin Kim
Dan, This is good to hear as we are heavily invested in Spark as are many of our competitors in the AdTech/Telecom world. It would be nice to have Kudu be on par with the other data store technologies in terms of Spark usability, so as to not choose one based on “who provides it now in

Re: Performance Question

2016-07-06 Thread Dan Burkert
On Wed, Jul 6, 2016 at 7:05 AM, Benjamin Kim wrote: > Over the weekend, the row count is up to <500M. I will give it another few > days to get to 1B rows. I still get consistent times ~15s for doing row > counts despite the amount of data growing. > > On another note, I got a

Re: Performance Question

2016-07-06 Thread Dan Burkert
016年7月2日(星期六) 02:44 > 收件人:user <user@kudu.incubator.apache.org> > 主 题:Re: Performance Question > > On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > Hi Todd, > > I changed the key to be what you suggested, and I can’t tell the > differen

Re: Performance Question

2016-07-06 Thread Benjamin Kim
Over the weekend, the row count is up to <500M. I will give it another few days to get to 1B rows. I still get consistent times ~15s for doing row counts despite the amount of data growing. On another note, I got a solicitation email from SnappyData to evaluate their product. They claim to be

Re: Performance Question

2016-07-01 Thread Todd Lipcon
On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim wrote: > Hi Todd, > > I changed the key to be what you suggested, and I can’t tell the > difference since it was already fast. But, I did get more numbers. > Yea, you won't see a substantial difference until you're inserting

Re: Performance Question

2016-06-30 Thread Benjamin Kim
Hi Todd, I changed the key to be what you suggested, and I can’t tell the difference since it was already fast. But, I did get more numbers. > 104M rows in Kudu table - read: 8s - count: 16s - aggregate: 9s The time to read took much longer from 0.2s to 8s, counts were the same 16s, and

Re: Performance Question

2016-06-29 Thread Todd Lipcon
On Wed, Jun 29, 2016 at 2:18 PM, Benjamin Kim wrote: > Todd, > > FYI. The key is unique for every row so rows are not going to already > exist. Basically, everything is an INSERT. > > val generateUUID = udf(() => UUID.randomUUID().toString) > > As you can see, we are using

Re: Performance Question

2016-06-29 Thread Benjamin Kim
Todd, FYI. The key is unique for every row so rows are not going to already exist. Basically, everything is an INSERT. val generateUUID = udf(() => UUID.randomUUID().toString) As you can see, we are using UUID java library to create the key. Cheers, Ben > On Jun 29, 2016, at 1:32 PM, Todd

Re: Performance Question

2016-06-29 Thread Todd Lipcon
On Wed, Jun 29, 2016 at 11:32 AM, Benjamin Kim wrote: > Todd, > > I started Spark streaming more events into Kudu. Performance is great > there too! With HBase, it’s fast too, but I noticed that it pauses here and > there, making it take seconds for > 40k rows at a time,

Re: Performance Question

2016-06-29 Thread Benjamin Kim
Todd, I started Spark streaming more events into Kudu. Performance is great there too! With HBase, it’s fast too, but I noticed that it pauses here and there, making it take seconds for > 40k rows at a time, while Kudu doesn’t. The progress bar just blinks by. I will keep this running until it

Re: Performance Question

2016-06-28 Thread Todd Lipcon
Cool, thanks for the report, Ben. For what it's worth, I think there's still some low hanging fruit in the Spark connector for Kudu (for example, I believe locality on reads is currently broken). So, you can expect performance to continue to improve in future versions. I'd also be interested to

Re: Performance Question

2016-06-28 Thread Benjamin Kim
FYI. I did a quick-n-dirty performance test. First, the setup: QA cluster: 15 data nodes 64GB memory each HBase is using 4GB of memory Kudu is using 1GB of memory 1 HBase/Kudu master node 64GB memory HBase/Kudu master is using 1GB of memory each 10Gb Ethernet Using Spark on both to load/read

Re: Performance Question

2016-06-15 Thread Dan Burkert
Adding partition splits when range partitioning is done via the CreateTableOptions.addSplitRow method. You can find more about the different partitioning options in the schema design

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Todd, I think the locality is not within our setup. We have the compute cluster with Spark, YARN, etc. on its own, and we have the storage cluster with HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute cluster and beefed up storage capacity on the storage cluster. We

Re: Performance Question

2016-06-15 Thread Todd Lipcon
Awesome use case. One thing to keep in mind is that spark parallelism will be limited by the number of tablets. So, you might want to split into 10 or so buckets per node to get the best query throughput. Usually if you run top on some machines while running the query you can see if it is fully

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd, I did a simple test of our ad events. We stream using Spark Streaming directly into HBase, and the Data Analysts/Scientists do some insight/discovery work plus some reports generation. For the reports, we use SQL, and the more deeper stuff, we use Spark. In Spark, our main data

Re: Performance Question

2016-06-15 Thread Todd Lipcon
Hi Benjamin, What workload are you using for benchmarks? Using spark or something more custom? rdd or data frame or SQL, etc? Maybe you can share the schema and some queries Todd Todd On Jun 15, 2016 8:10 AM, "Benjamin Kim" wrote: > Hi Todd, > > Now that Kudu 0.9.0 is out.

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd, Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. Compared to HBase, read and write performance are better. Write performance has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only preliminary tests. Do you know of a way to really

Re: Performance Question

2016-05-28 Thread Benjamin Kim
Todd, It sounds like Kudu can possibly top or match those numbers put out by Aerospike. Do you have any performance statistics published or any instructions as to measure them myself as good way to test? In addition, this will be a test using Spark, so should I wait for Kudu version 0.9.0

Re: Performance Question

2016-05-27 Thread Todd Lipcon
On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim wrote: > Hi Mike, > > First of all, thanks for the link. It looks like an interesting read. I > checked that Aerospike is currently at version 3.8.2.3, and in the article, > they are evaluating version 3.5.4. The main thing that

Re: Performance Question

2016-05-27 Thread Benjamin Kim
Hi Mike, First of all, thanks for the link. It looks like an interesting read. I checked that Aerospike is currently at version 3.8.2.3, and in the article, they are evaluating version 3.5.4. The main thing that impressed me was their claim that they can beat Cassandra and HBase by 8x for

Re: Performance Question

2016-05-27 Thread Mike Percy
Have you considered whether you have a scan heavy or a random access heavy workload? Have you considered whether you always access / update a whole row vs only a partial row? Kudu is a column store so has some awesome performance characteristics when you are doing a lot of scanning of just a