Re: Performance Question

Benjamin Kim Wed, 29 Jun 2016 11:33:04 -0700

Todd,

I started Spark streaming more events into Kudu. Performance is great there 
too! With HBase, it’s fast too, but I noticed that it pauses here and there, 
making it take seconds for > 40k rows at a time, while Kudu doesn’t. The 
progress bar just blinks by. I will keep this running until it hits 1B rows and 
rerun my performance tests. This, hopefully, will give better numbers.


Thanks,
Ben


> On Jun 28, 2016, at 4:26 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Cool, thanks for the report, Ben. For what it's worth, I think there's still 
> some low hanging fruit in the Spark connector for Kudu (for example, I 
> believe locality on reads is currently broken). So, you can expect 
> performance to continue to improve in future versions. I'd also be interested 
> to see results on Kudu for a much larger dataset - my guess is a lot of the 6 
> seconds you're seeing is constant overhead from Spark job setup, etc, given 
> that the performance doesn't seem to get slower as you went from 700K rows to 
> 13M rows.
> 
> -Todd
> 
> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> FYI.
> 
> I did a quick-n-dirty performance test.
> 
> First, the setup:
> QA cluster:
> 15 data nodes
> 64GB memory each
> HBase is using 4GB of memory
> Kudu is using 1GB of memory
> 1 HBase/Kudu master node
> 64GB memory
> HBase/Kudu master is using 1GB of memory each
> 10Gb Ethernet
> 
> Using Spark on both to load/read events data (84 columns per row), I was able 
> to record performance for each. On the HBase side, I used the Phoenix 4.7 
> Spark plugin where DataFrames can be used directly. On the Kudu side, I used 
> the Spark connector. I created an events table in Phoenix using the CREATE 
> TABLE statement and created the equivalent in Kudu using the Spark method 
> based off of a DataFrame schema.
> 
> Here are the numbers for Phoenix/HBase.
> 1st run:
> > 715k rows
> - write: 2.7m
> 
> > 715k rows in HBase table
> - read: 0.1s
> - count: 3.8s
> - aggregate: 61s
> 
> 2nd run:
> > 5.2M rows
> - write: 11m
> * had 4 region servers go down, had to retry the 5.2M row write
> 
> > 5.9M rows in HBase table
> - read: 8s
> - count: 3m
> - aggregate: 46s
> 
> 3rd run:
> > 6.8M rows
> - write: 9.6m
> 
> > 12.7M rows
> - read: 10s
> - count: 3m
> - aggregate: 44s
> 
> 
> Here are the numbers for Kudu.
> 1st run:
> > 715k rows
> - write: 18s
> 
> > 715k rows in Kudu table
> - read: 0.2s
> - count: 18s
> - aggregate: 5s
> 
> 2nd run:
> > 5.2M rows
> - write: 33s
> 
> > 5.9M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> 3rd run:
> > 6.8M rows
> - write: 27s
> 
> > 12.7M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> The Kudu results are impressive if you take these number as-is. Kudu is close 
> to 18x faster at writing (UPSERT). Kudu is 30x faster at reading (HBase times 
> increase as data size grows).  Kudu is 7x faster at full row counts. Lastly, 
> Kudu is 3x faster doing an aggregate query (count distinct event_id’s per 
> user_id). *Remember that this is small cluster, times are still respectable 
> for both systems, HBase could have been configured better, and the HBase 
> table could have been better tuned.
> 
> Cheers,
> Ben
> 
> 
>> On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Adding partition splits when range partitioning is done via the 
>> CreateTableOptions.addSplitRow 
>> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>>  method.  You can find more about the different partitioning options in the 
>> schema design guide 
>> <http://getkudu.io/docs/schema_design.html#data-distribution>.  We generally 
>> recommend sticking to hash partitioning if possible, since you don't have to 
>> determine your own split rows.
>> 
>> - Dan
>> 
>> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> I think the locality is not within our setup. We have the compute cluster 
>> with Spark, YARN, etc. on its own, and we have the storage cluster with 
>> HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute 
>> cluster and beefed up storage capacity on the storage cluster. We got this 
>> setup idea from the Databricks folks. I do have a question. I created the 
>> table to use range partition on columns. I see that if I use hash partition 
>> I can set the number of splits, but how do I do that using range (50 nodes * 
>> 10 = 500 splits)?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> Awesome use case. One thing to keep in mind is that spark parallelism will 
>>> be limited by the number of tablets. So, you might want to split into 10 or 
>>> so buckets per node to get the best query throughput.
>>> 
>>> Usually if you run top on some machines while running the query you can see 
>>> if it is fully utilizing the cores.
>>> 
>>> Another known issue right now is that spark locality isn't working properly 
>>> on replicated tables so you will use a lot of network traffic. For a perf 
>>> test you might want to try a table with replication count 1
>>> 
>>> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Todd,
>>> 
>>> I did a simple test of our ad events. We stream using Spark Streaming 
>>> directly into HBase, and the Data Analysts/Scientists do some 
>>> insight/discovery work plus some reports generation. For the reports, we 
>>> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data 
>>> currency store of choice is DataFrames.
>>> 
>>> The schema is around 83 columns wide where most are of the string data type.
>>> 
>>> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
>>> "user_id", "mappable_id",
>>> "cookie_status", "profile_status", "user_status", "previous_timestamp", 
>>> "user_agent", "referer",
>>> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", 
>>> "creative_id",
>>> "location_id", “pcamp_id",
>>> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", 
>>> "isp", "line_speed",
>>> "gender", "year_of_birth", "behaviors_read", "behaviors_written", 
>>> "key_value_pairs", "acamp_candidates",
>>> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", 
>>> "pixel_id", “video_id",
>>> "video_network_id", "video_time_watched", "video_percentage_watched", 
>>> "video_media_type",
>>> "video_player_iframed", "video_player_in_view", "video_player_width", 
>>> "video_player_height",
>>> "conversion_valid_sale", "conversion_sale_amount", 
>>> "conversion_commission_amount", "conversion_step",
>>> "conversion_currency", "conversion_attribution", "conversion_offer_id", 
>>> "custom_info", "frequency",
>>> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
>>> "optimizer_creative_id", "optimizer_ecpm", "impression_id", 
>>> "diagnostic_data",
>>> "user_profile_mapping_source", "latitude", "longitude", "area_code", 
>>> "gmt_offset", "in_dst",
>>> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", 
>>> "timestamp_iso", "reference_id",
>>> "identity_organization", "identity_method"
>>> 
>>> Most queries are like counts of how many users use what browser, how many 
>>> are unique users, etc. The part that scares most users is when it comes to 
>>> joining this data with other dimension/3rd party events tables because of 
>>> shear size of it.
>>> 
>>> We do what most companies do, similar to what I saw in earlier 
>>> presentations of Kudu. We dump data out of HBase into partitioned Parquet 
>>> tables to make query performance manageable.
>>> 
>>> I will coordinate with a data scientist today to do some tests. He is 
>>> working on identity matching/record linking of users from 2 domains: US and 
>>> Singapore, using probabilistic deduping algorithms. I will load the data 
>>> from ad events from both countries, and let him run his process against 
>>> this data in Kudu. I hope this will “wow” the team.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com 
>>>> <mailto:t...@cloudera.com>> wrote:
>>>> 
>>>> Hi Benjamin,
>>>> 
>>>> What workload are you using for benchmarks? Using spark or something more 
>>>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
>>>> some queries
>>>> 
>>>> Todd
>>>> 
>>>> Todd
>>>> 
>>>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi Todd,
>>>> 
>>>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am 
>>>> impressed. Compared to HBase, read and write performance are better. Write 
>>>> performance has the greatest improvement (> 4x), while read is > 1.5x. 
>>>> Albeit, these are only preliminary tests. Do you know of a way to really 
>>>> do some conclusive tests? I want to see if I can match your results on my 
>>>> 50 node cluster.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>>>>> <mailto:t...@cloudera.com>> wrote:
>>>>> 
>>>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Todd,
>>>>> 
>>>>> It sounds like Kudu can possibly top or match those numbers put out by 
>>>>> Aerospike. Do you have any performance statistics published or any 
>>>>> instructions as to measure them myself as good way to test? In addition, 
>>>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>>>>> where support will be built in?
>>>>> 
>>>>> We don't have a lot of benchmarks published yet, especially on the write 
>>>>> side. I've found that thorough cross-system benchmarks are very difficult 
>>>>> to do fairly and accurately, and often times users end up misguided if 
>>>>> they pay too much attention to them :) So, given a finite number of 
>>>>> developers working on Kudu, I think we've tended to spend more time on 
>>>>> the project itself and less time focusing on "competition". I'm sure 
>>>>> there are use cases where Kudu will beat out Aerospike, and probably use 
>>>>> cases where Aerospike will beat Kudu as well.
>>>>> 
>>>>> From my perspective, it would be great if you can share some details of 
>>>>> your workload, especially if there are some areas you're finding Kudu 
>>>>> lacking. Maybe we can spot some easy code changes we could make to 
>>>>> improve performance, or suggest a tuning variable you could change.
>>>>> 
>>>>> -Todd
>>>>> 
>>>>> 
>>>>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>>>>>> <mailto:t...@cloudera.com>> wrote:
>>>>>> 
>>>>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>> Hi Mike,
>>>>>> 
>>>>>> First of all, thanks for the link. It looks like an interesting read. I 
>>>>>> checked that Aerospike is currently at version 3.8.2.3, and in the 
>>>>>> article, they are evaluating version 3.5.4. The main thing that 
>>>>>> impressed me was their claim that they can beat Cassandra and HBase by 
>>>>>> 8x for writing and 25x for reading. Their big claim to fame is that 
>>>>>> Aerospike can write 1M records per second with only 50 nodes. I wanted 
>>>>>> to see if this is real.
>>>>>> 
>>>>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>>>>> depending on the size of your records and the insertion order. I've been 
>>>>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>>>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, 
>>>>>> and with pretty old HDD-only nodes. I think newer flash-based nodes 
>>>>>> could do better.
>>>>>>  
>>>>>> 
>>>>>> To answer your questions, we have a DMP with user profiles with many 
>>>>>> attributes. We create segmentation information off of these attributes 
>>>>>> to classify them. Then, we can target advertising appropriately for our 
>>>>>> sales department. Much of the data processing is for applying models on 
>>>>>> all or if not most of every profile’s attributes to find similarities 
>>>>>> (nearest neighbor/clustering) over a large number of rows when batch 
>>>>>> processing or a small subset of rows for quick online scoring. So, our 
>>>>>> use case is a typical advanced analytics scenario. We have tried HBase, 
>>>>>> but it doesn’t work well for these types of analytics.
>>>>>> 
>>>>>> I read, that Aerospike in the release notes, they did do many 
>>>>>> improvements for batch and scan operations.
>>>>>> 
>>>>>> I wonder what your thoughts are for using Kudu for this.
>>>>>> 
>>>>>> Sounds like a good Kudu use case to me. I've heard great things about 
>>>>>> Aerospike for the low latency random access portion, but I've also heard 
>>>>>> that it's _very_ expensive, and not particularly suited to the columnar 
>>>>>> scan workload. Lastly, I think the Apache license of Kudu is much more 
>>>>>> appealing than the AGPL3 used by Aerospike. But, that's not really a 
>>>>>> direct answer to the performance question :)
>>>>>>  
>>>>>> 
>>>>>> Thanks,
>>>>>> Ben
>>>>>> 
>>>>>> 
>>>>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>>>>>>> <mailto:mpe...@cloudera.com>> wrote:
>>>>>>> 
>>>>>>> Have you considered whether you have a scan heavy or a random access 
>>>>>>> heavy workload? Have you considered whether you always access / update 
>>>>>>> a whole row vs only a partial row? Kudu is a column store so has some 
>>>>>>> awesome performance characteristics when you are doing a lot of 
>>>>>>> scanning of just a couple of columns.
>>>>>>> 
>>>>>>> I don't know the answer to your question but if your concern is 
>>>>>>> performance then I would be interested in seeing comparisons from a 
>>>>>>> perf perspective on certain workloads.
>>>>>>> 
>>>>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
>>>>>>> https://aphyr.com/posts/324-jepsen-aerospike 
>>>>>>> <https://aphyr.com/posts/324-jepsen-aerospike>
>>>>>>> 
>>>>>>> I wonder if they have addressed any of those issues.
>>>>>>> 
>>>>>>> Mike
>>>>>>> 
>>>>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
>>>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>>>> I am just curious. How will Kudu compare with Aerospike 
>>>>>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a 
>>>>>>> Spark Roadshow and found out about this piece of software. It appears 
>>>>>>> to fit our use case perfectly since we are an ad-tech company trying to 
>>>>>>> leverage our user profiles data. Plus, it already has a Spark connector 
>>>>>>> and has a SQL-like client. The tables can be accessed using Spark SQL 
>>>>>>> DataFrames and, also, made into SQL tables for direct use with Spark 
>>>>>>> SQL ODBC/JDBC Thriftserver. I see from the work done here 
>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration 
>>>>>>> is well underway and, from the looks of it lately, almost complete. I 
>>>>>>> would prefer to use Kudu since we are already a Cloudera shop, and Kudu 
>>>>>>> is easy to deploy and configure using Cloudera Manager. I also hope 
>>>>>>> that some of Aerospike’s speed optimization techniques can make it into 
>>>>>>> Kudu in the future, if they have not been already thought of or 
>>>>>>> included.
>>>>>>> 
>>>>>>> Just some thoughts…
>>>>>>> 
>>>>>>> Cheers,
>>>>>>> Ben
>>>>>>> 
>>>>>>> 
>>>>>>> -- 
>>>>>>> --
>>>>>>> Mike Percy
>>>>>>> Software Engineer, Cloudera
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> -- 
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Performance Question

Reply via email to