Todd, FYI. The key is unique for every row so rows are not going to already exist. Basically, everything is an INSERT.
val generateUUID = udf(() => UUID.randomUUID().toString) As you can see, we are using UUID java library to create the key. Cheers, Ben > On Jun 29, 2016, at 1:32 PM, Todd Lipcon <t...@cloudera.com> wrote: > > On Wed, Jun 29, 2016 at 11:32 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Todd, > > I started Spark streaming more events into Kudu. Performance is great there > too! With HBase, it’s fast too, but I noticed that it pauses here and there, > making it take seconds for > 40k rows at a time, while Kudu doesn’t. The > progress bar just blinks by. I will keep this running until it hits 1B rows > and rerun my performance tests. This, hopefully, will give better numbers. > > Cool! We have invested a lot of work in making Kudu have consistent > performance, like you mentioned. It's generally been my experience that most > mature ops people would prefer a system which consistently performs well > rather than one which has higher peak performance but occasionally stalls. > > BTW, what is your row key design? One exception to the above is that, if > you're doing random inserts, you may see performance "fall off a cliff" once > the size of your key columns becomes larger than the aggregate memory size of > your cluster, if you're running on hard disks. Our inserts require checks for > duplicate keys, and that can cause random disk IOs if your keys don't fit > comfortably in cache. This is one area that HBase is fundamentally going to > be faster based on its design. > > -Todd > > >> On Jun 28, 2016, at 4:26 PM, Todd Lipcon <t...@cloudera.com >> <mailto:t...@cloudera.com>> wrote: >> >> Cool, thanks for the report, Ben. For what it's worth, I think there's still >> some low hanging fruit in the Spark connector for Kudu (for example, I >> believe locality on reads is currently broken). So, you can expect >> performance to continue to improve in future versions. I'd also be >> interested to see results on Kudu for a much larger dataset - my guess is a >> lot of the 6 seconds you're seeing is constant overhead from Spark job >> setup, etc, given that the performance doesn't seem to get slower as you >> went from 700K rows to 13M rows. >> >> -Todd >> >> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com >> <mailto:bbuil...@gmail.com>> wrote: >> FYI. >> >> I did a quick-n-dirty performance test. >> >> First, the setup: >> QA cluster: >> 15 data nodes >> 64GB memory each >> HBase is using 4GB of memory >> Kudu is using 1GB of memory >> 1 HBase/Kudu master node >> 64GB memory >> HBase/Kudu master is using 1GB of memory each >> 10Gb Ethernet >> >> Using Spark on both to load/read events data (84 columns per row), I was >> able to record performance for each. On the HBase side, I used the Phoenix >> 4.7 Spark plugin where DataFrames can be used directly. On the Kudu side, I >> used the Spark connector. I created an events table in Phoenix using the >> CREATE TABLE statement and created the equivalent in Kudu using the Spark >> method based off of a DataFrame schema. >> >> Here are the numbers for Phoenix/HBase. >> 1st run: >> > 715k rows >> - write: 2.7m >> >> > 715k rows in HBase table >> - read: 0.1s >> - count: 3.8s >> - aggregate: 61s >> >> 2nd run: >> > 5.2M rows >> - write: 11m >> * had 4 region servers go down, had to retry the 5.2M row write >> >> > 5.9M rows in HBase table >> - read: 8s >> - count: 3m >> - aggregate: 46s >> >> 3rd run: >> > 6.8M rows >> - write: 9.6m >> >> > 12.7M rows >> - read: 10s >> - count: 3m >> - aggregate: 44s >> >> >> Here are the numbers for Kudu. >> 1st run: >> > 715k rows >> - write: 18s >> >> > 715k rows in Kudu table >> - read: 0.2s >> - count: 18s >> - aggregate: 5s >> >> 2nd run: >> > 5.2M rows >> - write: 33s >> >> > 5.9M rows in Kudu table >> - read: 0.2s >> - count: 16s >> - aggregate: 6s >> >> 3rd run: >> > 6.8M rows >> - write: 27s >> >> > 12.7M rows in Kudu table >> - read: 0.2s >> - count: 16s >> - aggregate: 6s >> >> The Kudu results are impressive if you take these number as-is. Kudu is >> close to 18x faster at writing (UPSERT). Kudu is 30x faster at reading >> (HBase times increase as data size grows). Kudu is 7x faster at full row >> counts. Lastly, Kudu is 3x faster doing an aggregate query (count distinct >> event_id’s per user_id). *Remember that this is small cluster, times are >> still respectable for both systems, HBase could have been configured better, >> and the HBase table could have been better tuned. >> >> Cheers, >> Ben >> >> >>> On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com >>> <mailto:d...@cloudera.com>> wrote: >>> >>> Adding partition splits when range partitioning is done via the >>> CreateTableOptions.addSplitRow >>> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow-> >>> method. You can find more about the different partitioning options in the >>> schema design guide >>> <http://getkudu.io/docs/schema_design.html#data-distribution>. We >>> generally recommend sticking to hash partitioning if possible, since you >>> don't have to determine your own split rows. >>> >>> - Dan >>> >>> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com >>> <mailto:bbuil...@gmail.com>> wrote: >>> Todd, >>> >>> I think the locality is not within our setup. We have the compute cluster >>> with Spark, YARN, etc. on its own, and we have the storage cluster with >>> HBase, Kudu, etc. on another. We beefed up the hardware specs on the >>> compute cluster and beefed up storage capacity on the storage cluster. We >>> got this setup idea from the Databricks folks. I do have a question. I >>> created the table to use range partition on columns. I see that if I use >>> hash partition I can set the number of splits, but how do I do that using >>> range (50 nodes * 10 = 500 splits)? >>> >>> Thanks, >>> Ben >>> >>> >>>> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com >>>> <mailto:t...@cloudera.com>> wrote: >>>> >>>> Awesome use case. One thing to keep in mind is that spark parallelism will >>>> be limited by the number of tablets. So, you might want to split into 10 >>>> or so buckets per node to get the best query throughput. >>>> >>>> Usually if you run top on some machines while running the query you can >>>> see if it is fully utilizing the cores. >>>> >>>> Another known issue right now is that spark locality isn't working >>>> properly on replicated tables so you will use a lot of network traffic. >>>> For a perf test you might want to try a table with replication count 1 >>>> >>>> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com >>>> <mailto:bbuil...@gmail.com>> wrote: >>>> Hi Todd, >>>> >>>> I did a simple test of our ad events. We stream using Spark Streaming >>>> directly into HBase, and the Data Analysts/Scientists do some >>>> insight/discovery work plus some reports generation. For the reports, we >>>> use SQL, and the more deeper stuff, we use Spark. In Spark, our main data >>>> currency store of choice is DataFrames. >>>> >>>> The schema is around 83 columns wide where most are of the string data >>>> type. >>>> >>>> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", >>>> "user_id", "mappable_id", >>>> "cookie_status", "profile_status", "user_status", "previous_timestamp", >>>> "user_agent", "referer", >>>> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", >>>> "creative_id", >>>> "location_id", “pcamp_id", >>>> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", >>>> "isp", "line_speed", >>>> "gender", "year_of_birth", "behaviors_read", "behaviors_written", >>>> "key_value_pairs", "acamp_candidates", >>>> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", >>>> "pixel_id", “video_id", >>>> "video_network_id", "video_time_watched", "video_percentage_watched", >>>> "video_media_type", >>>> "video_player_iframed", "video_player_in_view", "video_player_width", >>>> "video_player_height", >>>> "conversion_valid_sale", "conversion_sale_amount", >>>> "conversion_commission_amount", "conversion_step", >>>> "conversion_currency", "conversion_attribution", "conversion_offer_id", >>>> "custom_info", "frequency", >>>> "recency_seconds", "cost", "revenue", “optimizer_acamp_id", >>>> "optimizer_creative_id", "optimizer_ecpm", "impression_id", >>>> "diagnostic_data", >>>> "user_profile_mapping_source", "latitude", "longitude", "area_code", >>>> "gmt_offset", "in_dst", >>>> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", >>>> "timestamp_iso", "reference_id", >>>> "identity_organization", "identity_method" >>>> >>>> Most queries are like counts of how many users use what browser, how many >>>> are unique users, etc. The part that scares most users is when it comes to >>>> joining this data with other dimension/3rd party events tables because of >>>> shear size of it. >>>> >>>> We do what most companies do, similar to what I saw in earlier >>>> presentations of Kudu. We dump data out of HBase into partitioned Parquet >>>> tables to make query performance manageable. >>>> >>>> I will coordinate with a data scientist today to do some tests. He is >>>> working on identity matching/record linking of users from 2 domains: US >>>> and Singapore, using probabilistic deduping algorithms. I will load the >>>> data from ad events from both countries, and let him run his process >>>> against this data in Kudu. I hope this will “wow” the team. >>>> >>>> Thanks, >>>> Ben >>>> >>>>> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com >>>>> <mailto:t...@cloudera.com>> wrote: >>>>> >>>>> Hi Benjamin, >>>>> >>>>> What workload are you using for benchmarks? Using spark or something more >>>>> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and >>>>> some queries >>>>> >>>>> Todd >>>>> >>>>> Todd >>>>> >>>>> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com >>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>> Hi Todd, >>>>> >>>>> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am >>>>> impressed. Compared to HBase, read and write performance are better. >>>>> Write performance has the greatest improvement (> 4x), while read is > >>>>> 1.5x. Albeit, these are only preliminary tests. Do you know of a way to >>>>> really do some conclusive tests? I want to see if I can match your >>>>> results on my 50 node cluster. >>>>> >>>>> Thanks, >>>>> Ben >>>>> >>>>>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com >>>>>> <mailto:t...@cloudera.com>> wrote: >>>>>> >>>>>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com >>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>> Todd, >>>>>> >>>>>> It sounds like Kudu can possibly top or match those numbers put out by >>>>>> Aerospike. Do you have any performance statistics published or any >>>>>> instructions as to measure them myself as good way to test? In addition, >>>>>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 >>>>>> where support will be built in? >>>>>> >>>>>> We don't have a lot of benchmarks published yet, especially on the write >>>>>> side. I've found that thorough cross-system benchmarks are very >>>>>> difficult to do fairly and accurately, and often times users end up >>>>>> misguided if they pay too much attention to them :) So, given a finite >>>>>> number of developers working on Kudu, I think we've tended to spend more >>>>>> time on the project itself and less time focusing on "competition". I'm >>>>>> sure there are use cases where Kudu will beat out Aerospike, and >>>>>> probably use cases where Aerospike will beat Kudu as well. >>>>>> >>>>>> From my perspective, it would be great if you can share some details of >>>>>> your workload, especially if there are some areas you're finding Kudu >>>>>> lacking. Maybe we can spot some easy code changes we could make to >>>>>> improve performance, or suggest a tuning variable you could change. >>>>>> >>>>>> -Todd >>>>>> >>>>>> >>>>>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com >>>>>>> <mailto:t...@cloudera.com>> wrote: >>>>>>> >>>>>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com >>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>> Hi Mike, >>>>>>> >>>>>>> First of all, thanks for the link. It looks like an interesting read. I >>>>>>> checked that Aerospike is currently at version 3.8.2.3, and in the >>>>>>> article, they are evaluating version 3.5.4. The main thing that >>>>>>> impressed me was their claim that they can beat Cassandra and HBase by >>>>>>> 8x for writing and 25x for reading. Their big claim to fame is that >>>>>>> Aerospike can write 1M records per second with only 50 nodes. I wanted >>>>>>> to see if this is real. >>>>>>> >>>>>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, >>>>>>> depending on the size of your records and the insertion order. I've >>>>>>> been playing with a ~70 node cluster recently and seen 1M+ >>>>>>> writes/second sustained, and bursting above 4M. These are 1KB rows with >>>>>>> 11 columns, and with pretty old HDD-only nodes. I think newer >>>>>>> flash-based nodes could do better. >>>>>>> >>>>>>> >>>>>>> To answer your questions, we have a DMP with user profiles with many >>>>>>> attributes. We create segmentation information off of these attributes >>>>>>> to classify them. Then, we can target advertising appropriately for our >>>>>>> sales department. Much of the data processing is for applying models on >>>>>>> all or if not most of every profile’s attributes to find similarities >>>>>>> (nearest neighbor/clustering) over a large number of rows when batch >>>>>>> processing or a small subset of rows for quick online scoring. So, our >>>>>>> use case is a typical advanced analytics scenario. We have tried HBase, >>>>>>> but it doesn’t work well for these types of analytics. >>>>>>> >>>>>>> I read, that Aerospike in the release notes, they did do many >>>>>>> improvements for batch and scan operations. >>>>>>> >>>>>>> I wonder what your thoughts are for using Kudu for this. >>>>>>> >>>>>>> Sounds like a good Kudu use case to me. I've heard great things about >>>>>>> Aerospike for the low latency random access portion, but I've also >>>>>>> heard that it's _very_ expensive, and not particularly suited to the >>>>>>> columnar scan workload. Lastly, I think the Apache license of Kudu is >>>>>>> much more appealing than the AGPL3 used by Aerospike. But, that's not >>>>>>> really a direct answer to the performance question :) >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Ben >>>>>>> >>>>>>> >>>>>>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com >>>>>>>> <mailto:mpe...@cloudera.com>> wrote: >>>>>>>> >>>>>>>> Have you considered whether you have a scan heavy or a random access >>>>>>>> heavy workload? Have you considered whether you always access / update >>>>>>>> a whole row vs only a partial row? Kudu is a column store so has some >>>>>>>> awesome performance characteristics when you are doing a lot of >>>>>>>> scanning of just a couple of columns. >>>>>>>> >>>>>>>> I don't know the answer to your question but if your concern is >>>>>>>> performance then I would be interested in seeing comparisons from a >>>>>>>> perf perspective on certain workloads. >>>>>>>> >>>>>>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: >>>>>>>> https://aphyr.com/posts/324-jepsen-aerospike >>>>>>>> <https://aphyr.com/posts/324-jepsen-aerospike> >>>>>>>> >>>>>>>> I wonder if they have addressed any of those issues. >>>>>>>> >>>>>>>> Mike >>>>>>>> >>>>>>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com >>>>>>>> <mailto:bbuil...@gmail.com>> wrote: >>>>>>>> I am just curious. How will Kudu compare with Aerospike >>>>>>>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a >>>>>>>> Spark Roadshow and found out about this piece of software. It appears >>>>>>>> to fit our use case perfectly since we are an ad-tech company trying >>>>>>>> to leverage our user profiles data. Plus, it already has a Spark >>>>>>>> connector and has a SQL-like client. The tables can be accessed using >>>>>>>> Spark SQL DataFrames and, also, made into SQL tables for direct use >>>>>>>> with Spark SQL ODBC/JDBC Thriftserver. I see from the work done here >>>>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ >>>>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration >>>>>>>> is well underway and, from the looks of it lately, almost complete. I >>>>>>>> would prefer to use Kudu since we are already a Cloudera shop, and >>>>>>>> Kudu is easy to deploy and configure using Cloudera Manager. I also >>>>>>>> hope that some of Aerospike’s speed optimization techniques can make >>>>>>>> it into Kudu in the future, if they have not been already thought of >>>>>>>> or included. >>>>>>>> >>>>>>>> Just some thoughts… >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Ben >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> -- >>>>>>>> Mike Percy >>>>>>>> Software Engineer, Cloudera >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Todd Lipcon >>>>>>> Software Engineer, Cloudera >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Todd Lipcon >>>>>> Software Engineer, Cloudera >>>>> >>>> >>> >>> >> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera > > > > > -- > Todd Lipcon > Software Engineer, Cloudera