On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> Hi Mike, > > First of all, thanks for the link. It looks like an interesting read. I > checked that Aerospike is currently at version 3.8.2.3, and in the article, > they are evaluating version 3.5.4. The main thing that impressed me was > their claim that they can beat Cassandra and HBase by 8x for writing and > 25x for reading. Their big claim to fame is that Aerospike can write 1M > records per second with only 50 nodes. I wanted to see if this is real. > 1M records per second on 50 nodes is pretty doable by Kudu as well, depending on the size of your records and the insertion order. I've been playing with a ~70 node cluster recently and seen 1M+ writes/second sustained, and bursting above 4M. These are 1KB rows with 11 columns, and with pretty old HDD-only nodes. I think newer flash-based nodes could do better. > > To answer your questions, we have a DMP with user profiles with many > attributes. We create segmentation information off of these attributes to > classify them. Then, we can target advertising appropriately for our sales > department. Much of the data processing is for applying models on all or if > not most of every profile’s attributes to find similarities (nearest > neighbor/clustering) over a large number of rows when batch processing or a > small subset of rows for quick online scoring. So, our use case is a > typical advanced analytics scenario. We have tried HBase, but it doesn’t > work well for these types of analytics. > > I read, that Aerospike in the release notes, they did do many improvements > for batch and scan operations. > > I wonder what your thoughts are for using Kudu for this. > Sounds like a good Kudu use case to me. I've heard great things about Aerospike for the low latency random access portion, but I've also heard that it's _very_ expensive, and not particularly suited to the columnar scan workload. Lastly, I think the Apache license of Kudu is much more appealing than the AGPL3 used by Aerospike. But, that's not really a direct answer to the performance question :) > > Thanks, > Ben > > > On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote: > > Have you considered whether you have a scan heavy or a random access heavy > workload? Have you considered whether you always access / update a whole > row vs only a partial row? Kudu is a column store so has some > awesome performance characteristics when you are doing a lot of scanning of > just a couple of columns. > > I don't know the answer to your question but if your concern is > performance then I would be interested in seeing comparisons from a perf > perspective on certain workloads. > > Finally, a year ago Aerospike did quite poorly in a Jepsen test: > https://aphyr.com/posts/324-jepsen-aerospike > > I wonder if they have addressed any of those issues. > > Mike > > On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com> wrote: > >> I am just curious. How will Kudu compare with Aerospike ( >> http://www.aerospike.com)? I went to a Spark Roadshow and found out >> about this piece of software. It appears to fit our use case perfectly >> since we are an ad-tech company trying to leverage our user profiles data. >> Plus, it already has a Spark connector and has a SQL-like client. The >> tables can be accessed using Spark SQL DataFrames and, also, made into SQL >> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from the >> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the Spark >> integration is well underway and, from the looks of it lately, almost >> complete. I would prefer to use Kudu since we are already a Cloudera shop, >> and Kudu is easy to deploy and configure using Cloudera Manager. I also >> hope that some of Aerospike’s speed optimization techniques can make it >> into Kudu in the future, if they have not been already thought of or >> included. >> >> Just some thoughts… >> >> Cheers, >> Ben > > > > -- > -- > Mike Percy > Software Engineer, Cloudera > > > > -- Todd Lipcon Software Engineer, Cloudera