We are about to prototype upgrading our batch inserts, so I’m really glad about
this thread… we are able to saturate our dedicated network links from hadoop
when inserting via thrift API (Astyanax) - at the time we wrote that code CQL
wasn’t there.
Reasons to replace our current solution:
1) W
Hi Eric, Ryan,
Thanks a lot for your insights. I got more than I hoped for in this
discussion.
I'll further improve our code to include the replica-awareness and will
compare that to the previous tests.
That snipped of code is really helpful. Thanks.
I have not been in the list long enough to ha
Yep, my approach is definitely naive to hotspotting. If someone had that
trouble, they could exhaust the iterator out of getReplicas() and
distribute their writes more evenly (which might result in better statement
distribution, but wouldn't change the workload on the cluster). In the end
they're
I think my main point is still, unlogged token aware batches are great, but if
you’re writes are large enough, they may actually hurt rather than help, and
likewise if your writes are too small, async only is likely only going to hurt.
I’d say the average user I’ve had to help (with my selectio
> compaction usually is the limiter for most clusters, so the difference
between async versus unlogged batch ends up being minor or worse..non
existent cause the hardware and data model combination result in compaction
being the main throttle.
If your number of records to load per second is predet
Generally this is all correct but I cannot emphasize enough how much this “just
depends” and today I generally move people to async inserts first before trying
to micro-optimize some things to keep in mind.
compaction usually is the limiter for most clusters, so the difference between
async ver
> I side-tracked some punctual benchmarks and stumbled on the observations
of unlogged inserts being *A LOT* faster than the async counterparts.
My own testing agrees very strongly with this. When this topic came up on
this list before, there was a concern that batch coordination produces GC
pres
General advice advocates for individual async inserts as the fastest way to
insert data into Cassandra. Our insertion mechanism is based on that model
and recently we have been evaluating performance, looking to measure and
optimize our ingestion rate.
I side-tracked some punctual benchmarks and s