Re: Performance Question

Benjamin Kim Thu, 30 Jun 2016 17:40:03 -0700

Hi Todd,

I changed the key to be what you suggested, and I can’t tell the difference 
since it was already fast. But, I did get more numbers.


> 104M rows in Kudu table
- read: 8s
- count: 16s
- aggregate: 9s

The time to read took much longer from 0.2s to 8s, counts were the same 16s, 
and aggregate queries look longer from 6s to 9s.

I’m still impressed.

Cheers,
Ben 

> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thorough cross-system benchmarks are very difficult to 
>> do fairly and accurately, and often times users end up misguided if they pay 
>> too much attention to them :) So, given a finite number of developers 
>> working on Kudu, I think we've tended to spend more time on the project 
>> itself and less time focusing on "competition". I'm sure there are use cases 
>> where Kudu will beat out Aerospike, and probably use cases where Aerospike 
>> will beat Kudu as well.
>> 
>> From my perspective, it would be great if you can share some details of your 
>> workload, especially if there are some areas you're finding Kudu lacking. 
>> Maybe we can spot some easy code changes we could make to improve 
>> performance, or suggest a tuning variable you could change.
>> 
>> -Todd
>> 
>> 
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Mike,
>>> 
>>> First of all, thanks for the link. It looks like an interesting read. I 
>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>>> they are evaluating version 3.5.4. The main thing that impressed me was 
>>> their claim that they can beat Cassandra and HBase by 8x for writing and 
>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M 
>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>> depending on the size of your records and the insertion order. I've been 
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>> better.
>>>  
>>> 
>>> To answer your questions, we have a DMP with user profiles with many 
>>> attributes. We create segmentation information off of these attributes to 
>>> classify them. Then, we can target advertising appropriately for our sales 
>>> department. Much of the data processing is for applying models on all or if 
>>> not most of every profile’s attributes to find similarities (nearest 
>>> neighbor/clustering) over a large number of rows when batch processing or a 
>>> small subset of rows for quick online scoring. So, our use case is a 
>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t 
>>> work well for these types of analytics.
>>> 
>>> I read, that Aerospike in the release notes, they did do many improvements 
>>> for batch and scan operations.
>>> 
>>> I wonder what your thoughts are for using Kudu for this.
>>> 
>>> Sounds like a good Kudu use case to me. I've heard great things about 
>>> Aerospike for the low latency random access portion, but I've also heard 
>>> that it's _very_ expensive, and not particularly suited to the columnar 
>>> scan workload. Lastly, I think the Apache license of Kudu is much more 
>>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct 
>>> answer to the performance question :)
>>>  
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>>>> <mailto:mpe...@cloudera.com>> wrote:
>>>> 
>>>> Have you considered whether you have a scan heavy or a random access heavy 
>>>> workload? Have you considered whether you always access / update a whole 
>>>> row vs only a partial row? Kudu is a column store so has some awesome 
>>>> performance characteristics when you are doing a lot of scanning of just a 
>>>> couple of columns.
>>>> 
>>>> I don't know the answer to your question but if your concern is 
>>>> performance then I would be interested in seeing comparisons from a perf 
>>>> perspective on certain workloads.
>>>> 
>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
>>>> https://aphyr.com/posts/324-jepsen-aerospike 
>>>> <https://aphyr.com/posts/324-jepsen-aerospike>
>>>> 
>>>> I wonder if they have addressed any of those issues.
>>>> 
>>>> Mike
>>>> 
>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> I am just curious. How will Kudu compare with Aerospike 
>>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a Spark 
>>>> Roadshow and found out about this piece of software. It appears to fit our 
>>>> use case perfectly since we are an ad-tech company trying to leverage our 
>>>> user profiles data. Plus, it already has a Spark connector and has a 
>>>> SQL-like client. The tables can be accessed using Spark SQL DataFrames 
>>>> and, also, made into SQL tables for direct use with Spark SQL ODBC/JDBC 
>>>> Thriftserver. I see from the work done here 
>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration is 
>>>> well underway and, from the looks of it lately, almost complete. I would 
>>>> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy 
>>>> to deploy and configure using Cloudera Manager. I also hope that some of 
>>>> Aerospike’s speed optimization techniques can make it into Kudu in the 
>>>> future, if they have not been already thought of or included.
>>>> 
>>>> Just some thoughts…
>>>> 
>>>> Cheers,
>>>> Ben
>>>> 
>>>> 
>>>> -- 
>>>> --
>>>> Mike Percy
>>>> Software Engineer, Cloudera
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>> 
>> 
>> 
>> 
>> -- 
>> Todd Lipcon
>> Software Engineer, Cloudera
>

Re: Performance Question

Reply via email to