Hi Todd, I changed the key to be what you suggested, and I can’t tell the difference since it was already fast. But, I did get more numbers.
> 104M rows in Kudu table - read: 8s - count: 16s - aggregate: 9s The time to read took much longer from 0.2s to 8s, counts were the same 16s, and aggregate queries look longer from 6s to 9s. I’m still impressed. Cheers, Ben > On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote: > > Hi Benjamin, > > What workload are you using for benchmarks? Using spark or something more > custom? rdd or data frame or SQL, etc? Maybe you can share the schema and > some queries > > Todd > > Todd > > On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Todd, > > Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. > Compared to HBase, read and write performance are better. Write performance > has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are > only preliminary tests. Do you know of a way to really do some conclusive > tests? I want to see if I can match your results on my 50 node cluster. > > Thanks, > Ben > >> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com >> <mailto:t...@cloudera.com>> wrote: >> >> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> Todd, >> >> It sounds like Kudu can possibly top or match those numbers put out by >> Aerospike. Do you have any performance statistics published or any >> instructions as to measure them myself as good way to test? In addition, >> this will be a test using Spark, so should I wait for Kudu version 0.9.0 >> where support will be built in? >> >> We don't have a lot of benchmarks published yet, especially on the write >> side. I've found that thorough cross-system benchmarks are very difficult to >> do fairly and accurately, and often times users end up misguided if they pay >> too much attention to them :) So, given a finite number of developers >> working on Kudu, I think we've tended to spend more time on the project >> itself and less time focusing on "competition". I'm sure there are use cases >> where Kudu will beat out Aerospike, and probably use cases where Aerospike >> will beat Kudu as well. >> >> From my perspective, it would be great if you can share some details of your >> workload, especially if there are some areas you're finding Kudu lacking. >> Maybe we can spot some easy code changes we could make to improve >> performance, or suggest a tuning variable you could change. >> >> -Todd >> >> >>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com >>> <mailto:t...@cloudera.com>> wrote: >>> >>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Hi Mike, >>> >>> First of all, thanks for the link. It looks like an interesting read. I >>> checked that Aerospike is currently at version 3.8.2.3, and in the article, >>> they are evaluating version 3.5.4. The main thing that impressed me was >>> their claim that they can beat Cassandra and HBase by 8x for writing and >>> 25x for reading. Their big claim to fame is that Aerospike can write 1M >>> records per second with only 50 nodes. I wanted to see if this is real. >>> >>> 1M records per second on 50 nodes is pretty doable by Kudu as well, >>> depending on the size of your records and the insertion order. I've been >>> playing with a ~70 node cluster recently and seen 1M+ writes/second >>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and >>> with pretty old HDD-only nodes. I think newer flash-based nodes could do >>> better. >>> >>> >>> To answer your questions, we have a DMP with user profiles with many >>> attributes. We create segmentation information off of these attributes to >>> classify them. Then, we can target advertising appropriately for our sales >>> department. Much of the data processing is for applying models on all or if >>> not most of every profile’s attributes to find similarities (nearest >>> neighbor/clustering) over a large number of rows when batch processing or a >>> small subset of rows for quick online scoring. So, our use case is a >>> typical advanced analytics scenario. We have tried HBase, but it doesn’t >>> work well for these types of analytics. >>> >>> I read, that Aerospike in the release notes, they did do many improvements >>> for batch and scan operations. >>> >>> I wonder what your thoughts are for using Kudu for this. >>> >>> Sounds like a good Kudu use case to me. I've heard great things about >>> Aerospike for the low latency random access portion, but I've also heard >>> that it's _very_ expensive, and not particularly suited to the columnar >>> scan workload. Lastly, I think the Apache license of Kudu is much more >>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct >>> answer to the performance question :) >>> >>> >>> Thanks, >>> Ben >>> >>> >>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com >>>> <mailto:mpe...@cloudera.com>> wrote: >>>> >>>> Have you considered whether you have a scan heavy or a random access heavy >>>> workload? Have you considered whether you always access / update a whole >>>> row vs only a partial row? Kudu is a column store so has some awesome >>>> performance characteristics when you are doing a lot of scanning of just a >>>> couple of columns. >>>> >>>> I don't know the answer to your question but if your concern is >>>> performance then I would be interested in seeing comparisons from a perf >>>> perspective on certain workloads. >>>> >>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: >>>> https://aphyr.com/posts/324-jepsen-aerospike >>>> <https://aphyr.com/posts/324-jepsen-aerospike> >>>> >>>> I wonder if they have addressed any of those issues. >>>> >>>> Mike >>>> >>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> I am just curious. How will Kudu compare with Aerospike >>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a Spark >>>> Roadshow and found out about this piece of software. It appears to fit our >>>> use case perfectly since we are an ad-tech company trying to leverage our >>>> user profiles data. Plus, it already has a Spark connector and has a >>>> SQL-like client. The tables can be accessed using Spark SQL DataFrames >>>> and, also, made into SQL tables for direct use with Spark SQL ODBC/JDBC >>>> Thriftserver. I see from the work done here >>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration is >>>> well underway and, from the looks of it lately, almost complete. I would >>>> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy >>>> to deploy and configure using Cloudera Manager. I also hope that some of >>>> Aerospike’s speed optimization techniques can make it into Kudu in the >>>> future, if they have not been already thought of or included. >>>> >>>> Just some thoughts… >>>> >>>> Cheers, >>>> Ben >>>> >>>> >>>> -- >>>> -- >>>> Mike Percy >>>> Software Engineer, Cloudera >>>> >>>> >>> >>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >