Hi Mark / Dennis, Can you provide the snippet of the code that puts a 5k record onto Riak as a map?
Chris On Tue, Oct 20, 2015 at 11:30 AM Mark Schmidt <mschm...@orcawave.net> wrote: > Hi folks, sorry for the confusion. > > > > Our scenario is as follows: > > > > We have a 6 node development cluster running on its own network segment > using HAProxy to facilitate load-balancing across the nodes. A single > Riak-dot-NET client service is performing the insert operations from > dedicated hardware located within the same network segment. We have basic > network throughput capabilities of 100 Mbit with an average speed > achievable of 75 Mbit. > > > > The data we are attempting to insert is composed of phone call record > receipts from telephone carriers. These records are batched and written to > a flat file for incorporation into our reporting engine. 1) Our Riak client > process takes a flat file (In this case, a 40MB collection of records, each > record being approximately 5k in size) and parses the entire file so each > record can be added to a local .NET queue. > > 2) Once the entire file has been parsed and each record loaded into the > local queue, 20 threads are spawned and connections are opened to our Riak > nodes via the HAProxy. > > 3) Each thread will pull a 5k record from the queue on a first come first > served basis and perform a put to the Riak environment. > > > > When first testing our client insert process, we were pushing the 5K > records as whole strings into the Riak environment. Network throughput > topped out at around 80 Mbits with a total load time of 90 seconds for 149k > records. When the client process was modified (same queuing and de-queuing > methods) so that a map datatype bucket would be created and keys stored as > registers, we saw network throughput drop to around 10 Mbit with total > upload time increase to around 270 seconds for the 149k records. > > > > It appears as though we’ve either encountered a potential bottleneck > unrelated to network throughput, or we’re just seeing an expected > processing penalty for our use of Riak datatypes. Please note, we’re > configuring Zabbix so we can monitor disk IO on each node as processor and > memory resources don’t appear to be the culprit either. > > > > If the reduction in processing speed is a natural consequence to utilizing > Riak data types, is the inter-node network the optimum place to increase > resources? Our eventual datacenter implementation will support speeds of > over 40 Gbit for inter-node communication. We’re just trying to identify > which levers from an operational standpoint we can throw to boost > performance, or if our client implementation is suspect. > > > > You bring up some excellent points regarding our use of CRDTs. In our > case, the call data records are mutable as they are subject to changes by > phone carriers for billing error corrections, incorrect data and a host of > other reasons. We may be better served by treating the records as immutable > and performing wide scale record removal and “reprocessing” in the event > changes to existing records are received/requested. > > > > Thank you, > > > > Mark Schmidt > > > > *From:* Alexander Sicular [mailto:sicul...@gmail.com] > *Sent:* Tuesday, October 20, 2015 10:55 AM > *To:* Dennis Nicolay <dnico...@orcawave.net> > *Cc:* Christopher Mancini <cmanc...@basho.com>; riak-users@lists.basho.com; > Mark Schmidt <mschm...@orcawave.net> > > > *Subject:* Re: Using Bucket Data Types slowed insert performance > > > > Let's talk about Riak data types for a moment. Riak data types are > collectively implementations of what academia refer to as CRDT's > (convergent or conflict free replicated data types.) The key benefit a CRDT > offers, over a traditional KV by contrast, is in automatic conflict > resolution. The various CRDT's provided in Riak have specific conflict > resolution strategies. This does not come for free. There is a > computational cost associated with CRDT's. If your use case requires > automated conflict resolution strategies than CRDT's are a good fit. > Internally CRDT's rely on vector clocks (see DVV's in the documentation) to > resolve conflict. > > > > Considering your ETL use case I'm going to presume that your data is > immutable (I could very well be wrong here.) If your data is immutable I > would consider simply using a KV and not paying the CRDT computational > penalty (and possibly even the write once bucket.) The CRDT penalty you pay > is obviously subjective to your use case, configuration, hw deployment etc. > > > > Hope that helps! > -Alexander > > > > @siculars > > http://siculars.posthaven.com > > > > Sent from my iRotaryPhone > > > On Oct 20, 2015, at 12:39, Dennis Nicolay <dnico...@orcawave.net> wrote: > > Hi Alexander, > > > > I’m parsing the file and storing each row with own key in a map datatype > bucket and each column is a register. > > > > Thanks, > > Dennis > > > > *From:* Alexander Sicular [mailto:sicul...@gmail.com <sicul...@gmail.com>] > > *Sent:* Tuesday, October 20, 2015 10:34 AM > *To:* Dennis Nicolay > *Cc:* Christopher Mancini; riak-users@lists.basho.com > *Subject:* Re: Using Bucket Data Types slowed insert performance > > > > Hi Dennis, > > > > It's a bit unclear what you are trying to do here. Are you 1. uploading > the entire file and saving it to one key with the value being the file? Or > are you 2. parsing the file and storing each row as a register in a map? > > > > Either of those approaches are not appropriate in Riak KV. For the first > case I would point you to Riak S2 which is designed to manage large binary > object storage. You can keep the large file as a single addressable entity > and access it via Amazon S3 or Swift protocol. For the second case I would > consider maintaining one key (map) per row in the file and have a register > per column in the row. Or not use Riak data types (maps, sets, registers, > flags and counters) and simply keep each row in the file as a KV in Riak > either as a raw string or as a serialized json string. ETL'ing out of > relational databases and into Riak is a very common use case and often > implemented in the fashion I described. > > > > As Chris mentioned, soft upper bound on value size should be 1MB. I say > soft because we won't enforce it although there are settings in the config > that can be changed to enforce it (default 5MB warning, 50MB reject I > believe.) > > Best, > > Alexander > > > > > @siculars > > http://siculars.posthaven.com > > > > Sent from my iRotaryPhone > > > On Oct 20, 2015, at 10:22, Christopher Mancini <cmanc...@basho.com> wrote: > > Hi Dennis, > > I am not the most experienced, but what I do know is that a file that size > causes a great deal of network chatter because it has to handoff that data > to the other nodes in the network and will cause delays in Riak's ability > to send and confirm consistency across the ring. Typically we recommend > that you try to structure your objects to around 1mb or less to ensure > consistent performance. That max object size can vary of course based on > your network / server specs and configuration. > > I hope this helps. > > Chris > > > > On Tue, Oct 20, 2015 at 8:18 AM Dennis Nicolay <dnico...@orcawave.net> > wrote: > > Hi, > > > > I’m using .net RiakClient 2.0 to insert a 44mb delimited file with 139k > rows of data into riak. I switched to a map bucket data type with > registers. It is taking about 3 times longer to insert into this bucket > vs non data typed bucket. Any suggestions? > > > > Thanks in advance, > > Dennis > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com