Re: Performance Difference between Batch Insert and Bulk Load

2014-12-05 Thread Dong Dai
Err, am i misunderstanding something? 
I thought Tyler is going to add some codes to split unlogged batch and make the 
batch insertion token aware.

it is already done? or else i can do it too.

thanks,
- Dong

> On Dec 5, 2014, at 2:06 PM, Philip Thompson  
> wrote:
> 
> What progress are you trying to be aware of? All of the features Tyler 
> discussed are implemented and can be used.
> 
> On Fri, Dec 5, 2014 at 2:41 PM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> 
>> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs > <mailto:ty...@datastax.com>> wrote:
>> 
>> 
>> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai > <mailto:daidon...@gmail.com>> wrote:
>> Sounds great! By the way, will you create a ticket for this, so we can 
>> follow the updates?
>> 
>> What would the ticket be for?  (I might have missed something in the 
>> conversation.)
>> 
> 
> Sorry, there aren’t any tickets then. I just want to have a way to be aware 
> of the progress. :)
> 
> - Dong
> 
>> 
>> -- 
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
> 
> 



Re: Performance Difference between Batch Insert and Bulk Load

2014-12-05 Thread Dong Dai

> On Dec 5, 2014, at 11:23 AM, Tyler Hobbs  wrote:
> 
> 
> On Fri, Dec 5, 2014 at 1:15 AM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> Sounds great! By the way, will you create a ticket for this, so we can follow 
> the updates?
> 
> What would the ticket be for?  (I might have missed something in the 
> conversation.)
> 

Sorry, there aren’t any tickets then. I just want to have a way to be aware of 
the progress. :)

- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>



Re: Performance Difference between Batch Insert and Bulk Load

2014-12-04 Thread Dong Dai

> On Dec 4, 2014, at 1:46 PM, Tyler Hobbs  wrote:
> 
> 
> On Thu, Dec 4, 2014 at 11:50 AM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> As we already did what coordinators do in client side, why don’t we do one 
> step more:
> break the UNLOGGED batch statements into several small batch statements, each 
> of which contains
> the statements with the same partition key. And send them to different 
> coordinators based
> on TokenAwarePolicy? This will save lots of RPC times, right?
> 
> The reason I asked is I have a use case where importing huge data into 
> Cassandra is a very common case, and all these importing do not need to be 
> atomic.
> 
> Yes, what you suggest is basically ideal.  I would do exactly that.
> 

Sounds great! By the way, will you create a ticket for this, so we can follow 
the updates?

thanks,
- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>



Re: Performance Difference between Batch Insert and Bulk Load

2014-12-04 Thread Dong Dai

> On Dec 4, 2014, at 11:37 AM, Tyler Hobbs  wrote:
> 
> 
> On Wed, Dec 3, 2014 at 11:02 PM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> 
> 1) except I am using TokenAwarePolicy, the async insert also can not be sent 
> to 
> the right coordinator. 
> 
> Yes.  Of course, TokenAwarePolicy can wrap any other policy.
>  
> 
> 2) the TokenAwarePolicy actually is doing the job that coordinators
> do: calculate the data placement by the keyspace and partition key. 
> 
> That's correct, it does the same calculation that the coordinator does.
> 

Thanks for the clarification. This leads to my previous discussion with Ryan. 
As we already did what coordinators do in client side, why don’t we do one step 
more:
break the UNLOGGED batch statements into several small batch statements, each 
of which contains
the statements with the same partition key. And send them to different 
coordinators based
on TokenAwarePolicy? This will save lots of RPC times, right?

The reason I asked is I have a use case where importing huge data into 
Cassandra is a very common case, and all these importing do not need to be 
atomic.

thanks,
- Dong

> 
> -- 
> Tyler Hobbs
> DataStax <http://datastax.com/>



Re: Performance Difference between Batch Insert and Bulk Load

2014-12-03 Thread Dong Dai
Thanks a lot for the great answers. P.S. I move this thread here from dev.

By checking the source code of java-driver, i noticed that the execute() method 
is implemented using executeAsync() 
with an immediate get():

@Override
public ResultSet execute(Statement statement) {
return executeAsync(statement).getUninterruptibly();
}

After checking different LoadBalancingPolicy implementations, Seems that only 
the 
TokenAwarePolicy will prefer the server with the local replica. Other policies, 
like
RoundRobinPolicy, seem just simply distributed each request into the next 
server.

So, does this mean: 

1) except I am using TokenAwarePolicy, the async insert also can not be sent to 
the right coordinator. 

2) the TokenAwarePolicy actually is doing the job that coordinators
do: calculate the data placement by the keyspace and partition key. 

thanks,
- Dong

> On Dec 2, 2014, at 9:13 AM, Ryan Svihla  wrote:
> 
> On Mon, Dec 1, 2014 at 1:52 PM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> 
>> Thanks Ryan, and also thanks for your great blog post.
>> 
>> However, this makes me more confused. Mainly about the coordinators.
>> 
>> Based on my understanding, no matter it is batch insertion, ordinary sync
>> insert, or async insert,
>> the coordinator was only selected once for the whole session by calling
>> cluster.connect(), and after
>> that, all the insertions will go through that coordinator.
>> 
> 
> That's all correct but what you're not accounting for is if you use a token
> aware client then the coordinator will likely not own all the data in a
> batch, ESPECIALLY as you scale up to more nodes. If you are using
> executeAsync and a single row then the coordinator node will always be an
> owner of the data, thereby minimizing network hops. Some people now stop me
> and say "but the client is making those hops!", and that's when I point out
> "what do you think the coordinator has to do", only you've introduced
> something in the middle, and prevent token awareness from doing it's job.
> The savings in latency are particularly huge if you use more than a
> consistency level one on your write.
> 
> 
>> If this is not the case, and the clients do more work, like distribute
>> each insert to different
>> coordinators based on its partition key. It is understandable the large
>> volume of UNLOGGED BATCH
>> will cause some bottleneck in the coordinator server. However, this should
>> be not hard to solve by distributing
>> insertions in one batch into different coordinators based on partition
>> keys. I will be curious why
>> this is not supported.
>> 
> 
> The coordinator node does this of course today, but this is the very
> bottleneck of which you refer. To do what you're wanting to do and make it
> work, you'd have to enhance the CLIENT to make sure that all the objects in
> that batch were actually owned by the coordinator itself, and if you're
> talking about parsing a CQL BATCH on the client and splitting it out to the
> appropriate nodes in some sort of hyper token awareness, then you're taking
> a server side responsibility (CQL parsing) and moving it to the client.
> Worse you're asking for a number of bugs to occur by moving CQL parsing to
> the client, IE do all clients handle this the same way? what happens to
> older thrift clients with batch?, etc, etc, etc.
> 
> Final point, every time you do a batch you're adding extra load on the heap
> to the coordinator node that could be instead on the client. This cannot be
> stated strongly enough. In production doing large batches (say over 5k) is
> a wonderful way to make your node spend a lot of it's time handling batches
> and the overhead of that process.
> 
>> 
>> P.S. I have the asynchronous insertion tested, probably because my dataset
>> is small. Batch insertion
>> is always much better than async insertions. Do you have a general idea
>> how large the dataset should be
>> to reverse this performance comparison.
>> 
> 
> You could be in a situation where the node owns all the data, and so can
> respond quickly, so it's hard to say, you can see however as the cluster
> scales there is no way that a given node will own everything in the batch
> unless you've designed it to be that way, either by some token aware batch
> generation in the client or by only batching on the same partition key
> (strategy covered in that blog).
> 
> PS Every time I've had a customer tell me batch is faster than async, it's
> been a code problem such as not storing futures for later, or in Python not
> using libev, in all cases I've gotten a

Re: Performance Difference between Batch Insert and Bulk Load

2014-12-01 Thread Dong Dai
Thanks Rob, 

I guess you mean that BulkLoader is done by streaming whole SSTable to remote 
servers, so it is faster?

The documentation says that all the rows in the SSTable will be inserted into 
the new cluster conforming to the replication strategy of that cluster. This 
gives me a felling that the BulkLoader was done by calling insertion after 
being transmitted to coordinators. 

I have this question because I tried batch insertion. It is too fast and makes 
me think that BulkLoader can not beat it.

thanks,
- Dong

> On Dec 1, 2014, at 1:37 PM, Robert Coli  wrote:
> 
> On Sun, Nov 30, 2014 at 8:44 PM, Dong Dai  <mailto:daidon...@gmail.com>> wrote:
> The question is can I expect a better performance using the BulkLoader this 
> way comparing with using Batch insert?
> 
> You just asked if writing once (via streaming) is likely to be significantly 
> more efficient than writing twice (once to the commit log, and then once at 
> flush time).
> 
> Yes.
> 
> =Rob
>  



Performance Difference between Batch Insert and Bulk Load

2014-11-30 Thread Dong Dai
Hi, all, 

I have a performance question about the batch insert and bulk load. 

According to the documents, to import large volume of data into Cassandra, 
Batch Insert and Bulk Load can both be an option. Using batch insert is pretty 
straightforwards, but there have not been an ‘official’ way to use Bulk Load to 
import the data (in this case, i mean the data was generated online). 

So, i am thinking first clients use CQLSSTableWriter to create the SSTable 
files, then use “org.apache.cassandra.tools.BulkLoader” to import these 
SSTables into Cassandra directly. 

The question is can I expect a better performance using the BulkLoader this way 
comparing with using Batch insert?

I am not so familiar with the implementation of Bulk Load. But i do see a huge 
performance improvement using Batch Insert. Really want to know the upper 
limits of the write performance. Any comment will be helpful, Thanks!

- Dong