Sorry Jack for my poor description, I write 600 times the same array of 1M of bytes to make my life easier. This allows me to simulate a 600Mb file. It's just a simplification. Instead of generating 600Mb random array (or reading a real 600Mb file), and dividing it into 600 chunks, I write the same random array 600 times. Every chunk corresponds to data field in the table. I realize that blob parameter of write method can lead to confusion (going to update on github at least)
I think that the content of the file is not important for the test itself, I just need 1MB of data to be written. Let me know if there are some other unclear spots. giampaolo 2016-02-09 1:28 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>: > I'm a little lost now. Where are you specifying chunk size, which is what > should be varying, as opposed to blob size? And what exactly is the number > of records? Seems like you should be computing number of chunks from blob > size divided by chunk size. And it still seems like you are writing the > same data for each chunk. > > -- Jack Krupansky > > On Mon, Feb 8, 2016 at 5:34 PM, Giampaolo Trapasso < > giampaolo.trapa...@radicalbit.io> wrote: > >> I write at every step MyConfig.blobsize number of bytes, that I >> configured to be from 100000 to 1000000. This allows me to "simulate" the >> writing of a 600Mb file, as configuration on github ( >> https://github.com/giampaolotrapasso/cassandratest/blob/master/src/main/resources/application.conf >> >> >> *)* >> Giampaolo >> >> 2016-02-08 23:25 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>: >> >>> You appear to be writing the entire bob on each chunk rather than the >>> slice of the blob. >>> >>> -- Jack Krupansky >>> >>> On Mon, Feb 8, 2016 at 1:45 PM, Giampaolo Trapasso < >>> giampaolo.trapa...@radicalbit.io> wrote: >>> >>>> Hi to all, >>>> >>>> I'm trying to put a large binary file (> 500MB) on a C* cluster as fast >>>> as I can but I get some (many) WriteTimeoutExceptions. >>>> >>>> I created a small POC that isolates the problem I'm facing. Here you >>>> will find the code: https://github.com/giampaolotrapasso/cassandratest, >>>> >>>> >>>> *Main details about it:* >>>> >>>> - I try to write the file into chunks (*data* field) <= 1MB (1MB is >>>> recommended max size for single cell), >>>> >>>> >>>> - Chunks are grouped into buckets. Every bucket is a partition row, >>>> - Buckets are grouped by UUIDs. >>>> >>>> >>>> - Chunk size and bucket size are configurable from app so I can try >>>> different configurations and see what happens. >>>> >>>> >>>> - Trying to max throughput, I execute asynch insertions, however to >>>> avoid too much pressure on the db, after a threshold, I wait at least >>>> for a >>>> finished insert to add another (this part is quite raw in my code but I >>>> think it's not so important). Also this parameter is configurable to >>>> test >>>> different combinations. >>>> >>>> This is the table on db: >>>> >>>> CREATE TABLE blobtest.store ( >>>> uuid uuid, >>>> bucket bigint, >>>> start bigint, >>>> data blob, >>>> end bigint, >>>> PRIMARY KEY ((uuid, bucket), start) >>>> ) >>>> >>>> and this is the main code (Scala, but I hope is be generally readable) >>>> >>>> val statement = client.session.prepare("INSERT INTO >>>> blobTest.store(uuid, bucket, start, end, data) VALUES (?, ?, ?, ?, ?) if >>>> not exists;") >>>> >>>> val blob = new Array[Byte](MyConfig.blobSize) >>>> scala.util.Random.nextBytes(blob) >>>> >>>> write(client, >>>> numberOfRecords = MyConfig.recordNumber, >>>> bucketSize = MyConfig.bucketSize, >>>> maxConcurrentWrites = MyConfig.maxFutures, >>>> blob, >>>> statement) >>>> >>>> where write is >>>> >>>> def write(database: Database, numberOfRecords: Int, bucketSize: Int, >>>> maxConcurrentWrites: Int, >>>> blob: Array[Byte], statement: PreparedStatement): Unit = { >>>> >>>> val uuid: UUID = UUID.randomUUID() >>>> var count = 0; >>>> >>>> //Javish loop >>>> while (count < numberOfRecords) { >>>> val record = Record( >>>> uuid = uuid, >>>> bucket = count / bucketSize, >>>> start = ((count % bucketSize)) * blob.length, >>>> end = ((count % bucketSize) + 1) * blob.length, >>>> bytes = blob >>>> ) >>>> asynchWrite(database, maxConcurrentWrites, statement, record) >>>> count += 1 >>>> } >>>> >>>> waitDbWrites() >>>> } >>>> >>>> and asynchWrite is just binding to statement >>>> >>>> *Problem* >>>> >>>> The problem is that when I try to increase the chunck size, or the >>>> number of asynch insert or the size of the bucket (ie number of chuncks), >>>> app become unstable since the db starts throwing WriteTimeoutException. >>>> >>>> I've tested the stuff on the CCM (4 nodes) and on a EC2 cluster (5 >>>> nodes, 8GB Heap). Problem seems the same on both enviroments. >>>> >>>> On my local cluster, I've tried to change respect to default >>>> configuration: >>>> >>>> concurrent_writes: 128 >>>> >>>> write_request_timeout_in_ms: 200000 >>>> >>>> other configurations are here: >>>> https://gist.github.com/giampaolotrapasso/ca21a83befd339075e07 >>>> >>>> *Other* >>>> >>>> Exceptions seems random, sometimes are at the beginning of the write >>>> >>>> *Questions:* >>>> >>>> 1. Is my model wrong? Am I missing some important detail? >>>> >>>> 2. What are the important information to look at for this kind of >>>> problem? >>>> >>>> 3. Why exceptions are so random? >>>> >>>> 4. There is some other C* parameter I can set to assure that >>>> WriteTimeoutException does not occur? >>>> >>>> I hope I provided enough information to get some help. >>>> >>>> Thank you in advance for any reply. >>>> >>>> >>>> Giampaolo >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >