Re: Performance Question

Dan Burkert Wed, 06 Jul 2016 09:12:15 -0700

On Mon, Jul 4, 2016 at 2:46 AM, 袁康（梓悠） <yuankang...@alibaba-inc.com> wrote:


> How can I delete data in kudu table wiht spark  (not delete the table at
> all)?
>

We do not currently have a way to delete a Kudu table through the spark
connector, but you should be able to instantiate a Kudu client and delete
the table that way.  We have discussed making one of the spark write modes
do a truncate operation, but nothing has been implemented.

 - Dan


> ------------------------------------------------------------------
> 发件人：Todd Lipcon <t...@cloudera.com>
> 发送时间：2016年7月2日(星期六) 02:44
> 收件人：user <user@kudu.incubator.apache.org>
> 主 题：Re: Performance Question
>
> On Thu, Jun 30, 2016 at 5:39 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> Hi Todd,
>
> I changed the key to be what you suggested, and I can’t tell the
> difference since it was already fast. But, I did get more numbers.
>
> Yea, you won't see a substantial difference until you're inserting
> billions of rows, etc, and the keys and/or bloom filters no longer fit in
> cache.
>
>
> > 104M rows in Kudu table
> - read: 8s
> - count: 16s
> - aggregate: 9s
>
> The time to read took much longer from 0.2s to 8s, counts were the same
> 16s, and aggregate queries look longer from 6s to 9s.
>
> I’m still impressed.
>
> We aim to please ;-) If you have any interest in writing up these
> experiments as a blog post, would be cool to post them for others to learn
> from.
>
> -Todd
>
> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
>
> Hi Benjamin,
>
> What workload are you using for benchmarks? Using spark or something more
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and
> some queries
>
> Todd
>
> Todd
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote:
> Hi Todd,
>
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am
> impressed. Compared to HBase, read and write performance are better. Write
> performance has the greatest improvement (> 4x), while read is > 1.5x.
> Albeit, these are only preliminary tests. Do you know of a way to really do
> some conclusive tests? I want to see if I can match your results on my 50
> node cluster.
>
> Thanks,
> Ben
>
> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
>
> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
> Todd,
>
> It sounds like Kudu can possibly top or match those numbers put out by
> Aerospike. Do you have any performance statistics published or any
> instructions as to measure them myself as good way to test? In addition,
> this will be a test using Spark, so should I wait for Kudu version 0.9.0
> where support will be built in?
>
> We don't have a lot of benchmarks published yet, especially on the write
> side. I've found that thorough cross-system benchmarks are very difficult
> to do fairly and accurately, and often times users end up misguided if they
> pay too much attention to them :) So, given a finite number of developers
> working on Kudu, I think we've tended to spend more time on the project
> itself and less time focusing on "competition". I'm sure there are use
> cases where Kudu will beat out Aerospike, and probably use cases where
> Aerospike will beat Kudu as well.
>
> From my perspective, it would be great if you can share some details of
> your workload, especially if there are some areas you're finding Kudu
> lacking. Maybe we can spot some easy code changes we could make to improve
> performance, or suggest a tuning variable you could change.
>
> -Todd
>
>
> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> Hi Mike,
>
> First of all, thanks for the link. It looks like an interesting read. I
> checked that Aerospike is currently at version 3.8.2.3, and in the article,
> they are evaluating version 3.5.4. The main thing that impressed me was
> their claim that they can beat Cassandra and HBase by 8x for writing and
> 25x for reading. Their big claim to fame is that Aerospike can write 1M
> records per second with only 50 nodes. I wanted to see if this is real.
>
> 1M records per second on 50 nodes is pretty doable by Kudu as well,
> depending on the size of your records and the insertion order. I've been
> playing with a ~70 node cluster recently and seen 1M+ writes/second
> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and
> with pretty old HDD-only nodes. I think newer flash-based nodes could do
> better.
>
>
> To answer your questions, we have a DMP with user profiles with many
> attributes. We create segmentation information off of these attributes to
> classify them. Then, we can target advertising appropriately for our sales
> department. Much of the data processing is for applying models on all or if
> not most of every profile’s attributes to find similarities (nearest
> neighbor/clustering) over a large number of rows when batch processing or a
> small subset of rows for quick online scoring. So, our use case is a
> typical advanced analytics scenario. We have tried HBase, but it doesn’t
> work well for these types of analytics.
>
> I read, that Aerospike in the release notes, they did do many improvements
> for batch and scan operations.
>
> I wonder what your thoughts are for using Kudu for this.
>
> Sounds like a good Kudu use case to me. I've heard great things about
> Aerospike for the low latency random access portion, but I've also heard
> that it's _very_ expensive, and not particularly suited to the columnar
> scan workload. Lastly, I think the Apache license of Kudu is much more
> appealing than the AGPL3 used by Aerospike. But, that's not really a direct
> answer to the performance question :)
>
>
> Thanks,
> Ben
>
>
> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote:
>
> Have you considered whether you have a scan heavy or a random access heavy
> workload? Have you considered whether you always access / update a whole
> row vs only a partial row? Kudu is a column store so has some
> awesome performance characteristics when you are doing a lot of scanning of
> just a couple of columns.
>
> I don't know the answer to your question but if your concern is
> performance then I would be interested in seeing comparisons from a perf
> perspective on certain workloads.
>
> Finally, a year ago Aerospike did quite poorly in a Jepsen test:
> https://aphyr.com/posts/324-jepsen-aerospike
>
> I wonder if they have addressed any of those issues.
>
> Mike
>
> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com> wrote:
> I am just curious. How will Kudu compare with Aerospike (
> http://www.aerospike.com)? I went to a Spark Roadshow and found out about
> this piece of software. It appears to fit our use case perfectly since we
> are an ad-tech company trying to leverage our user profiles data. Plus, it
> already has a Spark connector and has a SQL-like client. The tables can be
> accessed using Spark SQL DataFrames and, also, made into SQL tables for
> direct use with Spark SQL ODBC/JDBC Thriftserver. I see from the work done
> here http://gerrit.cloudera.org:8080/#/c/2992/ that the Spark integration
> is well underway and, from the looks of it lately, almost complete. I would
> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy
> to deploy and configure using Cloudera Manager. I also hope that some of
> Aerospike’s speed optimization techniques can make it into Kudu in the
> future, if they have not been already thought of or included.
>
> Just some thoughts…
>
> Cheers,
> Ben
>
>
> --
> --
> Mike Percy
> Software Engineer, Cloudera
>
>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
>

Re: Performance Question

Reply via email to