Re: Performance Question

Benjamin Kim Tue, 14 Jun 2016 23:10:47 -0700

Hi Todd,

Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
Compared to HBase, read and write performance are better. Write performance has 
the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only 
preliminary tests. Do you know of a way to really do some conclusive tests? I 
want to see if I can match your results on my 50 node cluster.


Thanks,
Ben

> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> It sounds like Kudu can possibly top or match those numbers put out by 
> Aerospike. Do you have any performance statistics published or any 
> instructions as to measure them myself as good way to test? In addition, this 
> will be a test using Spark, so should I wait for Kudu version 0.9.0 where 
> support will be built in?
> 
> We don't have a lot of benchmarks published yet, especially on the write 
> side. I've found that thorough cross-system benchmarks are very difficult to 
> do fairly and accurately, and often times users end up misguided if they pay 
> too much attention to them :) So, given a finite number of developers working 
> on Kudu, I think we've tended to spend more time on the project itself and 
> less time focusing on "competition". I'm sure there are use cases where Kudu 
> will beat out Aerospike, and probably use cases where Aerospike will beat 
> Kudu as well.
> 
> From my perspective, it would be great if you can share some details of your 
> workload, especially if there are some areas you're finding Kudu lacking. 
> Maybe we can spot some easy code changes we could make to improve 
> performance, or suggest a tuning variable you could change.
> 
> -Todd
> 
> 
>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Mike,
>> 
>> First of all, thanks for the link. It looks like an interesting read. I 
>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>> they are evaluating version 3.5.4. The main thing that impressed me was 
>> their claim that they can beat Cassandra and HBase by 8x for writing and 25x 
>> for reading. Their big claim to fame is that Aerospike can write 1M records 
>> per second with only 50 nodes. I wanted to see if this is real.
>> 
>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>> depending on the size of your records and the insertion order. I've been 
>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>> better.
>>  
>> 
>> To answer your questions, we have a DMP with user profiles with many 
>> attributes. We create segmentation information off of these attributes to 
>> classify them. Then, we can target advertising appropriately for our sales 
>> department. Much of the data processing is for applying models on all or if 
>> not most of every profile’s attributes to find similarities (nearest 
>> neighbor/clustering) over a large number of rows when batch processing or a 
>> small subset of rows for quick online scoring. So, our use case is a typical 
>> advanced analytics scenario. We have tried HBase, but it doesn’t work well 
>> for these types of analytics.
>> 
>> I read, that Aerospike in the release notes, they did do many improvements 
>> for batch and scan operations.
>> 
>> I wonder what your thoughts are for using Kudu for this.
>> 
>> Sounds like a good Kudu use case to me. I've heard great things about 
>> Aerospike for the low latency random access portion, but I've also heard 
>> that it's _very_ expensive, and not particularly suited to the columnar scan 
>> workload. Lastly, I think the Apache license of Kudu is much more appealing 
>> than the AGPL3 used by Aerospike. But, that's not really a direct answer to 
>> the performance question :)
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>>> <mailto:mpe...@cloudera.com>> wrote:
>>> 
>>> Have you considered whether you have a scan heavy or a random access heavy 
>>> workload? Have you considered whether you always access / update a whole 
>>> row vs only a partial row? Kudu is a column store so has some awesome 
>>> performance characteristics when you are doing a lot of scanning of just a 
>>> couple of columns.
>>> 
>>> I don't know the answer to your question but if your concern is performance 
>>> then I would be interested in seeing comparisons from a perf perspective on 
>>> certain workloads.
>>> 
>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
>>> https://aphyr.com/posts/324-jepsen-aerospike 
>>> <https://aphyr.com/posts/324-jepsen-aerospike>
>>> 
>>> I wonder if they have addressed any of those issues.
>>> 
>>> Mike
>>> 
>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I am just curious. How will Kudu compare with Aerospike 
>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a Spark 
>>> Roadshow and found out about this piece of software. It appears to fit our 
>>> use case perfectly since we are an ad-tech company trying to leverage our 
>>> user profiles data. Plus, it already has a Spark connector and has a 
>>> SQL-like client. The tables can be accessed using Spark SQL DataFrames and, 
>>> also, made into SQL tables for direct use with Spark SQL ODBC/JDBC 
>>> Thriftserver. I see from the work done here 
>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration is 
>>> well underway and, from the looks of it lately, almost complete. I would 
>>> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy 
>>> to deploy and configure using Cloudera Manager. I also hope that some of 
>>> Aerospike’s speed optimization techniques can make it into Kudu in the 
>>> future, if they have not been already thought of or included.
>>> 
>>> Just some thoughts…
>>> 
>>> Cheers,
>>> Ben
>>> 
>>> 
>>> -- 
>>> --
>>> Mike Percy
>>> Software Engineer, Cloudera
>>> 
>>> 
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Performance Question

Reply via email to