Re: Performance Difference between Batch Insert and Bulk Load

Dong Dai Tue, 02 Dec 2014 07:47:29 -0800

Yes. Thanks. I will reply it to user mailing list.

Sorry for the inconvenience.


- Dong

> On Dec 2, 2014, at 9:33 AM, Aleksey Yeschenko <alek...@apache.org> wrote:
> 
> Guys, please move this discussion to users mailing list. This one is for 
> Cassandra committers and other contributors, to discuss development of 
> Cassandra itself.
> 
> --
> AY
> 
>> On Dec 2, 2014, at 16:17, Ryan Svihla <rsvi...@datastax.com> wrote:
>> 
>> mispoke
>> 
>> "That's all correct but what you're not accounting for is if you use a
>> token aware client then the coordinator will likely not own all the data in
>> a batch"
>> 
>> should just be
>> 
>> "That's all correct but what you're not accounting for is the coordinator
>> will likely not own all the data in a batch"
>> 
>> Token awareness has no effect on that fact.
>> 
>>> On Tue, Dec 2, 2014 at 9:13 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai <daidon...@gmail.com> wrote:
>>>> 
>>>> Thanks Ryan, and also thanks for your great blog post.
>>>> 
>>>> However, this makes me more confused. Mainly about the coordinators.
>>>> 
>>>> Based on my understanding, no matter it is batch insertion, ordinary sync
>>>> insert, or async insert,
>>>> the coordinator was only selected once for the whole session by calling
>>>> cluster.connect(), and after
>>>> that, all the insertions will go through that coordinator.
>>> 
>>> That's all correct but what you're not accounting for is if you use a
>>> token aware client then the coordinator will likely not own all the data in
>>> a batch, ESPECIALLY as you scale up to more nodes. If you are using
>>> executeAsync and a single row then the coordinator node will always be an
>>> owner of the data, thereby minimizing network hops. Some people now stop me
>>> and say "but the client is making those hops!", and that's when I point out
>>> "what do you think the coordinator has to do", only you've introduced
>>> something in the middle, and prevent token awareness from doing it's job.
>>> The savings in latency are particularly huge if you use more than a
>>> consistency level one on your write.
>>> 
>>> 
>>>> If this is not the case, and the clients do more work, like distribute
>>>> each insert to different
>>>> coordinators based on its partition key. It is understandable the large
>>>> volume of UNLOGGED BATCH
>>>> will cause some bottleneck in the coordinator server. However, this
>>>> should be not hard to solve by distributing
>>>> insertions in one batch into different coordinators based on partition
>>>> keys. I will be curious why
>>>> this is not supported.
>>> 
>>> The coordinator node does this of course today, but this is the very
>>> bottleneck of which you refer. To do what you're wanting to do and make it
>>> work, you'd have to enhance the CLIENT to make sure that all the objects in
>>> that batch were actually owned by the coordinator itself, and if you're
>>> talking about parsing a CQL BATCH on the client and splitting it out to the
>>> appropriate nodes in some sort of hyper token awareness, then you're taking
>>> a server side responsibility (CQL parsing) and moving it to the client.
>>> Worse you're asking for a number of bugs to occur by moving CQL parsing to
>>> the client, IE do all clients handle this the same way? what happens to
>>> older thrift clients with batch?, etc, etc, etc.
>>> 
>>> Final point, every time you do a batch you're adding extra load on the
>>> heap to the coordinator node that could be instead on the client. This
>>> cannot be stated strongly enough. In production doing large batches (say
>>> over 5k) is a wonderful way to make your node spend a lot of it's time
>>> handling batches and the overhead of that process.
>>> 
>>>> 
>>>> P.S. I have the asynchronous insertion tested, probably because my
>>>> dataset is small. Batch insertion
>>>> is always much better than async insertions. Do you have a general idea
>>>> how large the dataset should be
>>>> to reverse this performance comparison.
>>> 
>>> You could be in a situation where the node owns all the data, and so can
>>> respond quickly, so it's hard to say, you can see however as the cluster
>>> scales there is no way that a given node will own everything in the batch
>>> unless you've designed it to be that way, either by some token aware batch
>>> generation in the client or by only batching on the same partition key
>>> (strategy covered in that blog).
>>> 
>>> PS Every time I've had a customer tell me batch is faster than async, it's
>>> been a code problem such as not storing futures for later, or in Python not
>>> using libev, in all cases I've gotten at least 2x speed up and often way
>>> more.
>>> 
>>> 
>>>> - Dong
>>>> 
>>>>> On Dec 1, 2014, at 9:57 AM, Ryan Svihla <rsvi...@datastax.com> wrote:
>>>>> 
>>>>> So there is a bit of a misunderstanding about the role of the
>>>> coordinator
>>>>> in all this. If you use an UNLOGGED BATCH and all of those writes are in
>>>>> the same partition key, then yes it's a savings and acts as one
>>>> mutation.
>>>>> If they're not however, you're asking the coordinator node to do work
>>>> the
>>>>> client could do, and you're potentially adding an extra round hop on
>>>>> several of those transactions if that coordinator node does not happen
>>>> to
>>>>> own that partition key (and assuming your client driver is using token
>>>>> awareness, as it is in recent versions of the DataStax Java Driver. This
>>>>> also says nothing of heap pressure, and the measurable effect of large
>>>>> batches on node performance is in practice a problem in production
>>>> clusters.
>>>>> 
>>>>> I frequently have had to switch people off using BATCH for bulk loading
>>>>> style processes and in _every_ single case it's been faster to use
>>>>> executeAsync..not to mention the cluster was healthier as a result.
>>>>> 
>>>>> As for the sstable loader options since they all use the streaming
>>>> protocol
>>>>> and as of today the streaming protocol will stream one copy to each
>>>> remote
>>>>> nodes, that they tend to be slower than even executeAsync in multi data
>>>>> center scenarios (though in single data center they're faster options,
>>>> that
>>>>> said..the executeAsync approach is often fast enough).
>>>>> 
>>>>> This is all covered in a blog post
>>>> https://medium.com/@foundev/cassandra-batch-loading-without-the-batch-keyword-40f00e35e23e
>>>>> and the DataStax CQL docs also reference BATCH is not a performance
>>>>> optimization
>>>> http://www.datastax.com/documentation/cql/3.1/cql/cql_using/useBatch.html
>>>>> 
>>>>> In summary the only way UNLOGGED BATCH is a performance improvement over
>>>>> using async with the driver is if they're within a certain reasonable
>>>> size
>>>>> and they're all to the same partition.
>>>>> 
>>>>>> On Mon, Dec 1, 2014 at 9:43 AM, Dong Dai <daidon...@gmail.com> wrote:
>>>>>> 
>>>>>> Thank a lot for the reply, Raj,
>>>>>> 
>>>>>> I understand they are different. But if we define a Batch with
>>>> UNLOGGED,
>>>>>> it will not guarantee the atomic transaction, and become more like a
>>>> data
>>>>>> import tool. According to my knowledge, BATCH statement packs several
>>>>>> mutations into one RPC to save time. Similarly, Bulk Loader also pack
>>>> all
>>>>>> the mutations as a SSTable file and (I think) may be able to save lot
>>>> of
>>>>>> time too.
>>>>>> 
>>>>>> I am interested that, in the coordinator server, are Batch Insert and
>>>> Bulk
>>>>>> Loader the similar thing? I mean are they implemented in the similar
>>>> way?
>>>>>> 
>>>>>> P.S. I try to randomly insert 1000 rows into a simple table on my
>>>> laptop
>>>>>> as a test. Sync Insert will take almost 2s to finish, but sync batch
>>>> insert
>>>>>> only take like 900ms. It is a huge performance improvement, I wonder is
>>>>>> this expected?
>>>>>> 
>>>>>> Also, I used CQLSStableWriter to put these 1000 insertions into a
>>>> single
>>>>>> SSTable file, it costs around 2s to finish on my laptop. Seems to be
>>>> pretty
>>>>>> slow.
>>>>>> 
>>>>>> thanks!
>>>>>> - Dong
>>>>>> 
>>>>>>>> On Dec 1, 2014, at 2:33 AM, Rajanarayanan Thottuvaikkatumana <
>>>>>>> rnambood...@gmail.com> wrote:
>>>>>>> 
>>>>>>> BATCH statement and Bulk Load are totally different things. The BATCH
>>>>>> statement comes in the atomic transaction space which provides a way to
>>>>>> make more than one statements into an atomic unit and bulk loader
>>>> provides
>>>>>> the ability to bulk load external data into a cluster. Two are totally
>>>>>> different things and cannot be compared.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> -Raj
>>>>>>> 
>>>>>>>> On 01-Dec-2014, at 4:32 am, Dong Dai <daidon...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> Hi, all,
>>>>>>>> 
>>>>>>>> I have a performance question about the batch insert and bulk load.
>>>>>>>> 
>>>>>>>> According to the documents, to import large volume of data into
>>>>>> Cassandra, Batch Insert and Bulk Load can both be an option. Using
>>>> batch
>>>>>> insert is pretty straightforwards, but there have not been an
>>>> ‘official’
>>>>>> way to use Bulk Load to import the data (in this case, i mean the data
>>>> was
>>>>>> generated online).
>>>>>>>> 
>>>>>>>> So, i am thinking first clients use CQLSSTableWriter to create the
>>>>>> SSTable files, then use “org.apache.cassandra.tools.BulkLoader”  to
>>>> import
>>>>>> these SSTables into Cassandra directly.
>>>>>>>> 
>>>>>>>> The question is can I expect a better performance using the
>>>> BulkLoader
>>>>>> this way comparing with using Batch insert?
>>>>>>>> 
>>>>>>>> I am not so familiar with the implementation of Bulk Load. But i do
>>>> see
>>>>>> a huge performance improvement using Batch Insert. Really want to know
>>>> the
>>>>>> upper limits of the write performance. Any comment will be helpful,
>>>> Thanks!
>>>>>>>> 
>>>>>>>> - Dong
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>>>> 
>>>>> Ryan Svihla
>>>>> 
>>>>> Solution Architect
>>>>> 
>>>>> [image: twitter.png] <https://twitter.com/foundev> [image:
>>>> linkedin.png]
>>>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>>>> 
>>>>> 
>>>>> DataStax is the fastest, most scalable distributed database technology,
>>>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>>>> Datastax is built to be agile, always-on, and predictably scalable to
>>>> any
>>>>> size. With more than 500 customers in 45 countries, DataStax is the
>>>>> database technology and transactional backbone of choice for the worlds
>>>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>>> 
>>> 
>>> --
>>> 
>>> [image: datastax_logo.png] <http://www.datastax.com/>
>>> 
>>> Ryan Svihla
>>> 
>>> Solution Architect
>>> 
>>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>>> 
>>> DataStax is the fastest, most scalable distributed database technology,
>>> delivering Apache Cassandra to the world’s most innovative enterprises.
>>> Datastax is built to be agile, always-on, and predictably scalable to any
>>> size. With more than 500 customers in 45 countries, DataStax is the
>>> database technology and transactional backbone of choice for the worlds
>>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>> 
>> 
>> -- 
>> 
>> [image: datastax_logo.png] <http://www.datastax.com/>
>> 
>> Ryan Svihla
>> 
>> Solution Architect
>> 
>> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
>> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>> 
>> DataStax is the fastest, most scalable distributed database technology,
>> delivering Apache Cassandra to the world’s most innovative enterprises.
>> Datastax is built to be agile, always-on, and predictably scalable to any
>> size. With more than 500 customers in 45 countries, DataStax is the
>> database technology and transactional backbone of choice for the worlds
>> most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Performance Difference between Batch Insert and Bulk Load

Reply via email to