Sorry Jack for my poor description,
I write 600 times the same array of 1M of bytes to make my life easier.
This allows me to simulate a 600Mb file. It's just a simplification.
Instead of generating 600Mb random array (or reading a real 600Mb file),
and dividing it into 600 chunks, I write the same random array 600 times.
Every chunk corresponds to data field in the table. I realize that blob
parameter of write method can lead to confusion (going to update on github
at least)

I think that the content of the file is not important for the test itself,
I just need 1MB of data to be written. Let me know if there are some other
unclear spots.

giampaolo


2016-02-09 1:28 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:

> I'm a little lost now. Where are you specifying chunk size, which is what
> should be varying, as opposed to blob size? And what exactly is the number
> of records? Seems like you should be computing number of chunks from blob
> size divided by chunk size. And it still seems like you are writing the
> same data for each chunk.
>
> -- Jack Krupansky
>
> On Mon, Feb 8, 2016 at 5:34 PM, Giampaolo Trapasso <
> giampaolo.trapa...@radicalbit.io> wrote:
>
>> I write at every step MyConfig.blobsize number of bytes, that I
>> configured to be from 100000 to 1000000. This allows me to "simulate" the
>> writing of a 600Mb file, as configuration on github (
>> https://github.com/giampaolotrapasso/cassandratest/blob/master/src/main/resources/application.conf
>>
>>
>> *)*
>>  Giampaolo
>>
>> 2016-02-08 23:25 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>
>>> You appear to be writing the entire bob on each chunk rather than the
>>> slice of the blob.
>>>
>>> -- Jack Krupansky
>>>
>>> On Mon, Feb 8, 2016 at 1:45 PM, Giampaolo Trapasso <
>>> giampaolo.trapa...@radicalbit.io> wrote:
>>>
>>>> Hi to all,
>>>>
>>>> I'm trying to put a large binary file (> 500MB) on a C* cluster as fast
>>>> as I can but I get some (many) WriteTimeoutExceptions.
>>>>
>>>> I created a small POC that isolates the problem I'm facing. Here you
>>>> will find the code: https://github.com/giampaolotrapasso/cassandratest,
>>>>
>>>>
>>>> *Main details about it:*
>>>>
>>>>    - I try to write the file into chunks (*data* field) <= 1MB (1MB is
>>>>    recommended max size for single cell),
>>>>
>>>>
>>>>    - Chunks are grouped into buckets. Every bucket is a partition row,
>>>>    - Buckets are grouped by UUIDs.
>>>>
>>>>
>>>>    - Chunk size and bucket size are configurable from app so I can try
>>>>    different configurations and see what happens.
>>>>
>>>>
>>>>    - Trying to max throughput, I execute asynch insertions, however to
>>>>    avoid too much pressure on the db, after a threshold, I wait at least 
>>>> for a
>>>>    finished insert to add another (this part is quite raw in my code but I
>>>>    think it's not so important). Also this parameter is configurable to 
>>>> test
>>>>    different combinations.
>>>>
>>>> This is the table on db:
>>>>
>>>> CREATE TABLE blobtest.store (
>>>>     uuid uuid,
>>>>     bucket bigint,
>>>>     start bigint,
>>>>     data blob,
>>>>     end bigint,
>>>>     PRIMARY KEY ((uuid, bucket), start)
>>>> )
>>>>
>>>> and this is the main code (Scala, but I hope is be generally readable)
>>>>
>>>>     val statement = client.session.prepare("INSERT INTO
>>>> blobTest.store(uuid, bucket, start, end, data) VALUES (?, ?, ?, ?, ?) if
>>>> not exists;")
>>>>
>>>>     val blob = new Array[Byte](MyConfig.blobSize)
>>>>     scala.util.Random.nextBytes(blob)
>>>>
>>>>     write(client,
>>>>       numberOfRecords = MyConfig.recordNumber,
>>>>       bucketSize = MyConfig.bucketSize,
>>>>       maxConcurrentWrites = MyConfig.maxFutures,
>>>>       blob,
>>>>       statement)
>>>>
>>>> where write is
>>>>
>>>> def write(database: Database, numberOfRecords: Int, bucketSize: Int,
>>>> maxConcurrentWrites: Int,
>>>>             blob: Array[Byte], statement: PreparedStatement): Unit = {
>>>>
>>>>     val uuid: UUID = UUID.randomUUID()
>>>>     var count = 0;
>>>>
>>>>     //Javish loop
>>>>     while (count < numberOfRecords) {
>>>>       val record = Record(
>>>>         uuid = uuid,
>>>>         bucket = count / bucketSize,
>>>>         start = ((count % bucketSize)) * blob.length,
>>>>         end = ((count % bucketSize) + 1) * blob.length,
>>>>         bytes = blob
>>>>       )
>>>>       asynchWrite(database, maxConcurrentWrites, statement, record)
>>>>       count += 1
>>>>     }
>>>>
>>>>     waitDbWrites()
>>>>   }
>>>>
>>>> and asynchWrite is just binding to statement
>>>>
>>>> *Problem*
>>>>
>>>> The problem is that when I try to increase the chunck size, or the
>>>> number of asynch insert or the size of the bucket (ie number of chuncks),
>>>> app become unstable since the db starts throwing WriteTimeoutException.
>>>>
>>>> I've tested the stuff on the CCM (4 nodes) and on a EC2 cluster (5
>>>> nodes, 8GB Heap). Problem seems the same on both enviroments.
>>>>
>>>> On my local cluster, I've tried to change respect to default
>>>> configuration:
>>>>
>>>> concurrent_writes: 128
>>>>
>>>> write_request_timeout_in_ms: 200000
>>>>
>>>> other configurations are here:
>>>> https://gist.github.com/giampaolotrapasso/ca21a83befd339075e07
>>>>
>>>> *Other*
>>>>
>>>> Exceptions seems random, sometimes are at the beginning of the write
>>>>
>>>> *Questions:*
>>>>
>>>> 1. Is my model wrong? Am I missing some important detail?
>>>>
>>>> 2. What are the important information to look at for this kind of
>>>> problem?
>>>>
>>>> 3. Why exceptions are so random?
>>>>
>>>> 4. There is some other C* parameter I can set to assure that
>>>> WriteTimeoutException does not occur?
>>>>
>>>> I hope I provided enough information to get some help.
>>>>
>>>> Thank you in advance for any reply.
>>>>
>>>>
>>>> Giampaolo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to