Adding partition splits when range partitioning is done via the CreateTableOptions.addSplitRow <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow-> method. You can find more about the different partitioning options in the schema design guide <http://getkudu.io/docs/schema_design.html#data-distribution>. We generally recommend sticking to hash partitioning if possible, since you don't have to determine your own split rows.
- Dan On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > Todd, > > I think the locality is not within our setup. We have the compute cluster > with Spark, YARN, etc. on its own, and we have the storage cluster with > HBase, Kudu, etc. on another. We beefed up the hardware specs on the > compute cluster and beefed up storage capacity on the storage cluster. We > got this setup idea from the Databricks folks. I do have a question. I > created the table to use range partition on columns. I see that if I use > hash partition I can set the number of splits, but how do I do that using > range (50 nodes * 10 = 500 splits)? > > Thanks, > Ben > > > On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com> wrote: > > Awesome use case. One thing to keep in mind is that spark parallelism will > be limited by the number of tablets. So, you might want to split into 10 or > so buckets per node to get the best query throughput. > > Usually if you run top on some machines while running the query you can > see if it is fully utilizing the cores. > > Another known issue right now is that spark locality isn't working > properly on replicated tables so you will use a lot of network traffic. For > a perf test you might want to try a table with replication count 1 > On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com> wrote: > > Hi Todd, > > I did a simple test of our ad events. We stream using Spark Streaming > directly into HBase, and the Data Analysts/Scientists do some > insight/discovery work plus some reports generation. For the reports, we > use SQL, and the more deeper stuff, we use Spark. In Spark, our main data > currency store of choice is DataFrames. > > The schema is around 83 columns wide where most are of the string data > type. > > "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", > "user_id", "mappable_id", > "cookie_status", "profile_status", "user_status", "previous_timestamp", > "user_agent", "referer", > "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", > "creative_id", > "location_id", “pcamp_id", > "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", > "isp", "line_speed", > "gender", "year_of_birth", "behaviors_read", "behaviors_written", > "key_value_pairs", "acamp_candidates", > "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", > "pixel_id", “video_id", > "video_network_id", "video_time_watched", "video_percentage_watched", > "video_media_type", > "video_player_iframed", "video_player_in_view", "video_player_width", > "video_player_height", > "conversion_valid_sale", "conversion_sale_amount", > "conversion_commission_amount", "conversion_step", > "conversion_currency", "conversion_attribution", "conversion_offer_id", > "custom_info", "frequency", > "recency_seconds", "cost", "revenue", “optimizer_acamp_id", > "optimizer_creative_id", "optimizer_ecpm", "impression_id", > "diagnostic_data", > "user_profile_mapping_source", "latitude", "longitude", "area_code", > "gmt_offset", "in_dst", > "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", > "timestamp_iso", "reference_id", > "identity_organization", "identity_method" > > Most queries are like counts of how many users use what browser, how many > are unique users, etc. The part that scares most users is when it comes to > joining this data with other dimension/3rd party events tables because of > shear size of it. > > We do what most companies do, similar to what I saw in earlier > presentations of Kudu. We dump data out of HBase into partitioned Parquet > tables to make query performance manageable. > > I will coordinate with a data scientist today to do some tests. He is > working on identity matching/record linking of users from 2 domains: US and > Singapore, using probabilistic deduping algorithms. I will load the data > from ad events from both countries, and let him run his process against > this data in Kudu. I hope this will “wow” the team. > > Thanks, > Ben > > On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote: > > Hi Benjamin, > > What workload are you using for benchmarks? Using spark or something more > custom? rdd or data frame or SQL, etc? Maybe you can share the schema and > some queries > > Todd > > Todd > On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com> wrote: > >> Hi Todd, >> >> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am >> impressed. Compared to HBase, read and write performance are better. Write >> performance has the greatest improvement (> 4x), while read is > 1.5x. >> Albeit, these are only preliminary tests. Do you know of a way to really do >> some conclusive tests? I want to see if I can match your results on my 50 >> node cluster. >> >> Thanks, >> Ben >> >> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote: >> >> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com> wrote: >> >>> Todd, >>> >>> It sounds like Kudu can possibly top or match those numbers put out by >>> Aerospike. Do you have any performance statistics published or any >>> instructions as to measure them myself as good way to test? In addition, >>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 >>> where support will be built in? >>> >> >> We don't have a lot of benchmarks published yet, especially on the write >> side. I've found that thorough cross-system benchmarks are very difficult >> to do fairly and accurately, and often times users end up misguided if they >> pay too much attention to them :) So, given a finite number of developers >> working on Kudu, I think we've tended to spend more time on the project >> itself and less time focusing on "competition". I'm sure there are use >> cases where Kudu will beat out Aerospike, and probably use cases where >> Aerospike will beat Kudu as well. >> >> From my perspective, it would be great if you can share some details of >> your workload, especially if there are some areas you're finding Kudu >> lacking. Maybe we can spot some easy code changes we could make to improve >> performance, or suggest a tuning variable you could change. >> >> -Todd >> >> >>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote: >>> >>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com> >>> wrote: >>> >>>> Hi Mike, >>>> >>>> First of all, thanks for the link. It looks like an interesting read. I >>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, >>>> they are evaluating version 3.5.4. The main thing that impressed me was >>>> their claim that they can beat Cassandra and HBase by 8x for writing and >>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M >>>> records per second with only 50 nodes. I wanted to see if this is real. >>>> >>> >>> 1M records per second on 50 nodes is pretty doable by Kudu as well, >>> depending on the size of your records and the insertion order. I've been >>> playing with a ~70 node cluster recently and seen 1M+ writes/second >>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and >>> with pretty old HDD-only nodes. I think newer flash-based nodes could do >>> better. >>> >>> >>>> >>>> To answer your questions, we have a DMP with user profiles with many >>>> attributes. We create segmentation information off of these attributes to >>>> classify them. Then, we can target advertising appropriately for our sales >>>> department. Much of the data processing is for applying models on all or if >>>> not most of every profile’s attributes to find similarities (nearest >>>> neighbor/clustering) over a large number of rows when batch processing or a >>>> small subset of rows for quick online scoring. So, our use case is a >>>> typical advanced analytics scenario. We have tried HBase, but it doesn’t >>>> work well for these types of analytics. >>>> >>>> I read, that Aerospike in the release notes, they did do many >>>> improvements for batch and scan operations. >>>> >>>> I wonder what your thoughts are for using Kudu for this. >>>> >>> >>> Sounds like a good Kudu use case to me. I've heard great things about >>> Aerospike for the low latency random access portion, but I've also heard >>> that it's _very_ expensive, and not particularly suited to the columnar >>> scan workload. Lastly, I think the Apache license of Kudu is much more >>> appealing than the AGPL3 used by Aerospike. But, that's not really a direct >>> answer to the performance question :) >>> >>> >>>> >>>> Thanks, >>>> Ben >>>> >>>> >>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote: >>>> >>>> Have you considered whether you have a scan heavy or a random access >>>> heavy workload? Have you considered whether you always access / update a >>>> whole row vs only a partial row? Kudu is a column store so has some >>>> awesome performance characteristics when you are doing a lot of scanning of >>>> just a couple of columns. >>>> >>>> I don't know the answer to your question but if your concern is >>>> performance then I would be interested in seeing comparisons from a perf >>>> perspective on certain workloads. >>>> >>>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: >>>> https://aphyr.com/posts/324-jepsen-aerospike >>>> >>>> I wonder if they have addressed any of those issues. >>>> >>>> Mike >>>> >>>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com> wrote: >>>> >>>>> I am just curious. How will Kudu compare with Aerospike ( >>>>> http://www.aerospike.com)? I went to a Spark Roadshow and found out >>>>> about this piece of software. It appears to fit our use case perfectly >>>>> since we are an ad-tech company trying to leverage our user profiles data. >>>>> Plus, it already has a Spark connector and has a SQL-like client. The >>>>> tables can be accessed using Spark SQL DataFrames and, also, made into SQL >>>>> tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see from >>>>> the >>>>> work done here http://gerrit.cloudera.org:8080/#/c/2992/ that the >>>>> Spark integration is well underway and, from the looks of it lately, >>>>> almost >>>>> complete. I would prefer to use Kudu since we are already a Cloudera shop, >>>>> and Kudu is easy to deploy and configure using Cloudera Manager. I also >>>>> hope that some of Aerospike’s speed optimization techniques can make it >>>>> into Kudu in the future, if they have not been already thought of or >>>>> included. >>>>> >>>>> Just some thoughts… >>>>> >>>>> Cheers, >>>>> Ben >>>> >>>> >>>> >>>> -- >>>> -- >>>> Mike Percy >>>> Software Engineer, Cloudera >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> Todd Lipcon >>> Software Engineer, Cloudera >>> >>> >>> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> >> >> > >