Data model for streaming a large table in real time.

2014-06-06 Thread Kevin Burton
We have the requirement to have clients read from our tables while they're
being written.

Basically, any write that we make to cassandra needs to be sent out over
the Internet to our customers.

We also need them to resume so if they go offline, they can just pick up
where they left off.

They need to do this in parallel, so if we have 20 cassandra nodes, they
can have 20 readers each efficiently (and without coordination) reading
from our tables.

Here's how we're planning on doing it.

We're going to use the ByteOrderedPartitioner .

I'm writing with a primary key of the timestamp, however, in practice, this
would yield hotspots.

(I'm also aware that time isn't a very good pk in a distribute system as I
can easily have a collision so we're going to use a scheme similar to a
uuid to make it unique per writer).

One node would take all the load, followed by the next node, etc.

So my plan to stop this is to prefix a slice ID to the timestamp.  This way
each piece of content has a unique ID, but the prefix will place it on a
node.

The slide ID is just a byte… so this means there are 255 buckets in which I
can place data.

This means I can have clients each start with a slice, and a timestamp, and
page through the data with tokens.

This way I can have a client reading with 255 threads from 255 regions in
the cluster, in parallel, without any hot spots.

Thoughts on this strategy?

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile


War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


Re: VPC AWS

2014-06-06 Thread Jonathan Haddad
This may not help you with the migration, but it may with maintenance &
management.  I just put up a blog post on managing VPC security groups with
a tool I open sourced at my previous company.  If you're going to have
different VPCs (staging / prod), it might help with managing security
groups.

http://rustyrazorblade.com/2014/06/an-introduction-to-roadhouse/

Semi shameless plug... but relevant.


On Thu, Jun 5, 2014 at 12:01 PM, Aiman Parvaiz  wrote:

> Cool, thanks again for this.
>
>
> On Thu, Jun 5, 2014 at 11:51 AM, Michael Theroux 
> wrote:
>
>> You can have a ring spread across EC2 and the public subnet of a VPC.
>>  That is how we did our migration.  In our case, we simply replaced the
>> existing EC2 node with a new instance in the public VPC, restored from a
>> backup taken right before the switch.
>>
>> -Mike
>>
>>   --
>>  *From:* Aiman Parvaiz 
>> *To:* Michael Theroux 
>> *Cc:* "user@cassandra.apache.org" 
>> *Sent:* Thursday, June 5, 2014 2:39 PM
>> *Subject:* Re: VPC AWS
>>
>> Thanks for this info Michael. As far as restoring node in public VPC is
>> concerned I was thinking ( and I might be wrong here) if we can have a ring
>> spread across EC2 and public subnet of a VPC, this way I can simply
>> decommission nodes in Ec2 as I gradually introduce new nodes in public
>> subnet of VPC and I will end up with a ring in public subnet and then
>> migrate them from public to private in a similar way may be.
>>
>> If anyone has any experience/ suggestions with this please share, would
>> really appreciate it.
>>
>> Aiman
>>
>>
>> On Thu, Jun 5, 2014 at 10:37 AM, Michael Theroux 
>> wrote:
>>
>> The implementation of moving from EC2 to a VPC was a bit of a juggling
>> act.  Our motivation was two fold:
>>
>> 1) We were running out of static IP addresses, and it was becoming
>> increasingly difficult in EC2 to design around limiting the number of
>> static IP addresses to the number of public IP addresses EC2 allowed
>> 2) VPC affords us an additional level of security that was desirable.
>>
>>  However, we needed to consider the following limitations:
>>
>>  1) By default, you have a limited number of available public IPs for
>> both EC2 and VPC.
>> 2) AWS security groups need to be configured to allow traffic for
>> Cassandra to/from instances in EC2 and the VPC.
>>
>>  You are correct at the high level that the migration goes from
>> EC2->Public VPC (VPC with an Internet Gateway)->Private VPC (VPC with a
>> NAT).  The first phase was moving instances to the public VPC, setting
>> broadcast and seeds to the public IPs we had available.  Basically:
>>
>> 1) Take down a node, taking a snapshot for a backup
>> 2) Restore the node on the public VPC, assigning it to the correct
>> security group, manually setting the seeds to other available nodes
>> 3) Verify the cluster can communicate
>> 4) Repeat
>>
>> Realize the NAT instance on the private subnet will also require a public
>> IP.  What got really interesting is that near the end of the process we
>> ran out of available IPs, requiring us to switch the final node that was on
>> EC2 directly to the private VPC (and taking down two nodes at once, which
>> our setup allowed given we had 6 nodes with an RF of 3).
>>
>> What we did, and highly suggest for the switch, is to write down every
>> step that has to happen on every node during the switch.  In our case, many
>> of the moved nodes required slightly different configurations for items
>> like the seeds.
>>
>> Its been a couple of years, so my memory on this maybe a little fuzzy :)
>>
>> -Mike
>>
>>   --
>>  *From:* Aiman Parvaiz 
>> *To:* user@cassandra.apache.org; Michael Theroux 
>> *Sent:* Thursday, June 5, 2014 12:55 PM
>> *Subject:* Re: VPC AWS
>>
>> Michael,
>> Thanks for the response, I am about to head in to something very similar
>> if not exactly same. I envision things happening on the same lines as you
>> mentioned.
>> I would be grateful if you could please throw some more light on how you
>> went about switching cassandra nodes from public subnet to private with out
>> any downtime.
>> I have not started on this project yet, still in my research phase. I
>> plan to have a ec2+public VPC cluster and then decomission ec2 nodes to
>> have everything in public subnet, next would be to move it to private
>> subnet.
>>
>> Thanks
>>
>>
>> On Thu, Jun 5, 2014 at 8:14 AM, Michael Theroux 
>> wrote:
>>
>> We personally use the EC2Snitch, however, we don't have the multi-region
>> requirements you do,
>>
>> -Mike
>>
>>   --
>>  *From:* Alain RODRIGUEZ 
>> *To:* user@cassandra.apache.org
>> *Sent:* Thursday, June 5, 2014 9:14 AM
>> *Subject:* Re: VPC AWS
>>
>> I think you can define VPC subnet to be public (to have public + private
>> IPs) or private only.
>>
>> Any insight regarding snitches ? What snitch do you guys use ?
>>
>>
>> 2014-06-05 15:06 GMT+02:00 William Oberman :
>>
>> I don't think traffic will 

Re: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int)

2014-06-06 Thread Kevin Burton
Thanks!! Yes. I completely missed that.  Not sure why… :)

Appreciate the help!


On Fri, Jun 6, 2014 at 2:59 AM, Laing, Michael 
wrote:

> select * from test_paging where *token(*id*)* > token(0);
>
> ml
>
>
> On Fri, Jun 6, 2014 at 1:47 AM, Jonathan Haddad  wrote:
>
>> Sorry, the datastax docs are actually a bit better:
>> http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
>>
>> Jon
>>
>>
>> On Thu, Jun 5, 2014 at 10:46 PM, Jonathan Haddad 
>> wrote:
>>
>>> You should read through the token docs, it has examples and
>>> specifications: http://cassandra.apache.org/doc/cql3/CQL.html#tokenFun
>>>
>>>
>>> On Thu, Jun 5, 2014 at 10:22 PM, Kevin Burton 
>>> wrote:
>>>
 I'm building a new schema which I need to read externally by paging
 through the result set.

 My understanding from reading the documentation , and this list, is
 that I can do that but I need to use the token() function.

 Only it doesn't work.

 Here's a reduction:


 create table test_paging (
 id int,
 primary key(id)
 );

 insert into test_paging (id) values (1);
 insert into test_paging (id) values (2);
 insert into test_paging (id) values (3);
 insert into test_paging (id) values (4);
 insert into test_paging (id) values (5);

 select * from test_paging where id > token(0);

 … but it gives me:

 Bad Request: Type error: cannot assign result of function token (type
 bigint) to id (type int)

 …

 What's that about?  I can't find any documentation for this and there
 aren't any concise examples.


 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.


>>>
>>>
>>> --
>>> Jon Haddad
>>> http://www.rustyrazorblade.com
>>> skype: rustyrazorblade
>>>
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>>
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile


War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.


python fast table copy/transform (subject updated)

2014-06-06 Thread Laing, Michael
Hi Marcelo,

I have updated the prerelease app in this gist:

https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47

I found that it was too easy to overrun my Cassandra clusters so I added a
throttle arg which by default is 1000 rows per second.

Fixed a few bugs too, reworked the args, etc.

I'll be interested to hear if you find it useful and/or have any comments.

ml


On Thu, Jun 5, 2014 at 1:09 PM, Marcelo Elias Del Valle <
marc...@s1mbi0se.com.br> wrote:

> Michael,
>
> I will try to test it up to tomorrow and I will let you know all the
> results.
>
> Thanks a lot!
>
> Best regards,
> Marcelo.
>
>
> 2014-06-04 22:28 GMT-03:00 Laing, Michael :
>
> BTW you might want to put a LIMIT clause on your SELECT for testing. -ml
>>
>>
>> On Wed, Jun 4, 2014 at 6:04 PM, Laing, Michael > > wrote:
>>
>>> Marcelo,
>>>
>>> Here is a link to the preview of the python fast copy program:
>>>
>>> https://gist.github.com/michaelplaing/37d89c8f5f09ae779e47
>>>
>>> It will copy a table from one cluster to another with some
>>> transformation- they can be the same cluster.
>>>
>>> It has 3 main throttles to experiment with:
>>>
>>>1. fetch_size: size of source pages in rows
>>>2. worker_count: number of worker subprocesses
>>>3. concurrency: number of async callback chains per worker subprocess
>>>
>>> It is easy to overrun Cassandra and the python driver, so I recommend
>>> starting with the defaults: fetch_size: 1000; worker_count: 2; concurrency:
>>> 10.
>>>
>>> Additionally there are switches to set 'policies' by source and
>>> destination: retry (downgrade consistency), dc_aware, and token_aware.
>>> retry is useful if you are getting timeouts. For the others YMMV.
>>>
>>> To use it you need to define the SELECT and UPDATE cql statements as
>>> well as the 'map_fields' method.
>>>
>>> The worker subprocesses divide up the token range among themselves and
>>> proceed quasi-independently. Each worker opens a connection to each cluster
>>> and the driver sets up connection pools to the nodes in the cluster. Anyway
>>> there are a lot of processes, threads, callbacks going at once so it is fun
>>> to watch.
>>>
>>> On my regional cluster of small nodes in AWS I got about 3000 rows per
>>> second transferred after things warmed up a bit - each row about 6kb.
>>>
>>> ml
>>>
>>>
>>> On Wed, Jun 4, 2014 at 11:49 AM, Laing, Michael <
>>> michael.la...@nytimes.com> wrote:
>>>
 OK Marcelo, I'll work on it today. -ml


 On Tue, Jun 3, 2014 at 8:24 PM, Marcelo Elias Del Valle <
 marc...@s1mbi0se.com.br> wrote:

> Hi Michael,
>
> For sure I would be interested in this program!
>
> I am new both to python and for cql. I started creating this copier,
> but was having problems with timeouts. Alex solved my problem here on the
> list, but I think I will still have a lot of trouble making the copy to
> work fine.
>
> I open sourced my version here:
> https://github.com/s1mbi0se/cql_record_processor
>
> Just in case it's useful for anything.
>
> However, I saw CQL has support for concurrency itself and having
> something made by someone who knows Python CQL Driver better would be very
> helpful.
>
> My two servers today are at OVH (ovh.com), we have servers at AWS but
> but several cases we prefer other hosts. Both servers have SDD and 64 Gb
> RAM, I could use the script as a benchmark for you if you want. Besides, 
> we
> have some bigger clusters, I could run on the just to test the speed if
> this is going to help.
>
> Regards
> Marcelo.
>
>
> 2014-06-03 11:40 GMT-03:00 Laing, Michael :
>
> Hi Marcelo,
>>
>> I could create a fast copy program by repurposing some python apps
>> that I am using for benchmarking the python driver - do you still need 
>> this?
>>
>> With high levels of concurrency and multiple subprocess workers,
>> based on my current actual benchmarks, I think I can get well over 1,000
>> rows/second on my mac and significantly more in AWS. I'm using variable
>> size rows averaging 5kb.
>>
>> This would be the initial version of a piece of the benchmark suite
>> we will release as part of our nyt⨍aбrik project on 21 June for my
>> Cassandra Day NYC talk re the python driver.
>>
>> ml
>>
>>
>> On Mon, Jun 2, 2014 at 2:15 PM, Marcelo Elias Del Valle <
>> marc...@s1mbi0se.com.br> wrote:
>>
>>> Hi Jens,
>>>
>>> Thanks for trying to help.
>>>
>>> Indeed, I know I can't do it using just CQL. But what would you use
>>> to migrate data manually? I tried to create a python program using auto
>>> paging, but I am getting timeouts. I also tried Hive, but no success.
>>> I only have two nodes and less than 200Gb in this cluster, any
>>> simple way to extract the data quickly would be good enough for me.
>>>
>>> Best regards,
>>> Marcelo.
>

Re: ANNOUNCEMENT: cassandra-aws project

2014-06-06 Thread Oleg Dulin

I guess I didn't know about the ComboAMI!

Thanks! I'll look into this.

I have been rolling my own AMIs for a simple reason -- we have 
on-premises environments and in AWS. I wanted them to be the same, 
structurally, so I used our on-prem configurations as a starting point.


Regards,
Oleg

On 2014-06-06 15:25:44 +, Michael Shuler said:


On 06/06/2014 09:57 AM, Oleg Dulin wrote:

I'd like to announce a pet project I started:
https://github.com/olegdulin/cassandra-aws


Cool  :)

https://github.com/riptano/ComboAMI is the DataStax AMI repo.


What I would like to accomplish as an end-goal is an Amazon marketplace
AMI that makes it easy to configure a new Cassandra cluster or add new
nodes to an existing Cassandra cluster, w/o having to jump through
hoops. Ideally I'd like to do for Cassandra what RDS does for PostgreSQL
in AWS, for instance, but I am not sure if ultimately it is possible.


Is there something that ComboAMI doesn't cover for your needs or is 
there some area that could be improved upon?



To get started, I shared some notes in the wiki as well as a couple of
scripts I used to simplify things for myself. I put those scripts
together from input I received on the #cassandra IRC channel and this
mailing list and I am very greatful to the community for helping me
through this -- so this is my contribution back.

Consider this email as a solicitation for help. I am open to
discussions, and contributions, and suggestions, anything you can help
with.


Would it be less overall work to implement changes you'd like to see by 
contributing them to ComboAMI?


I fully support lots of variations of tools - whatever makes things 
easiest for people to do exactly what they need, or in languages 
they're comfortable with, etc.






Re: ANNOUNCEMENT: cassandra-aws project

2014-06-06 Thread Michael Shuler

On 06/06/2014 09:57 AM, Oleg Dulin wrote:

I'd like to announce a pet project I started:
https://github.com/olegdulin/cassandra-aws


Cool  :)

https://github.com/riptano/ComboAMI is the DataStax AMI repo.


What I would like to accomplish as an end-goal is an Amazon marketplace
AMI that makes it easy to configure a new Cassandra cluster or add new
nodes to an existing Cassandra cluster, w/o having to jump through
hoops. Ideally I'd like to do for Cassandra what RDS does for PostgreSQL
in AWS, for instance, but I am not sure if ultimately it is possible.


Is there something that ComboAMI doesn't cover for your needs or is 
there some area that could be improved upon?



To get started, I shared some notes in the wiki as well as a couple of
scripts I used to simplify things for myself. I put those scripts
together from input I received on the #cassandra IRC channel and this
mailing list and I am very greatful to the community for helping me
through this -- so this is my contribution back.

Consider this email as a solicitation for help. I am open to
discussions, and contributions, and suggestions, anything you can help
with.


Would it be less overall work to implement changes you'd like to see by 
contributing them to ComboAMI?


I fully support lots of variations of tools - whatever makes things 
easiest for people to do exactly what they need, or in languages they're 
comfortable with, etc.


--
Michael


Re: ANNOUNCEMENT: cassandra-aws project

2014-06-06 Thread Philippe Dupont
Hi,

I'am interested to know differences between your AMI and the Datastax one,
already available in the market place.

Thanks,

Philippe


*Philippe Dupont*

root
Tel. +33(0)1.84.17.73.88
Mob. +33(0)6.10.14.58.26


[image: Description : http://cdn.teads.tv/images/logo_Teads_100.gif]


Video Advertising Solutions




2014-06-06 16:57 GMT+02:00 Oleg Dulin :

> Colleagues:
>
> I'd like to announce a pet project I started:
> https://github.com/olegdulin/cassandra-aws
>
> What I would like to accomplish as an end-goal is an Amazon marketplace
> AMI that makes it easy to configure a new Cassandra cluster or add new
> nodes to an existing Cassandra cluster, w/o having to jump through hoops.
> Ideally I'd like to do for Cassandra what RDS does for PostgreSQL in AWS,
> for instance, but I am not sure if ultimately it is possible.
>
> To get started, I shared some notes in the wiki as well as a couple of
> scripts I used to simplify things for myself. I put those scripts together
> from input I received on the #cassandra IRC channel and this mailing list
> and I am very greatful to the community for helping me through this -- so
> this is my contribution back.
>
> Consider this email as a solicitation for help. I am open to discussions,
> and contributions, and suggestions, anything you can help with.
>
>
> Regards,
> Oleg
>
>
>


ANNOUNCEMENT: cassandra-aws project

2014-06-06 Thread Oleg Dulin

Colleagues:

I'd like to announce a pet project I started: 
https://github.com/olegdulin/cassandra-aws


What I would like to accomplish as an end-goal is an Amazon marketplace 
AMI that makes it easy to configure a new Cassandra cluster or add new 
nodes to an existing Cassandra cluster, w/o having to jump through 
hoops. Ideally I'd like to do for Cassandra what RDS does for 
PostgreSQL in AWS, for instance, but I am not sure if ultimately it is 
possible.


To get started, I shared some notes in the wiki as well as a couple of 
scripts I used to simplify things for myself. I put those scripts 
together from input I received on the #cassandra IRC channel and this 
mailing list and I am very greatful to the community for helping me 
through this -- so this is my contribution back.


Consider this email as a solicitation for help. I am open to 
discussions, and contributions, and suggestions, anything you can help 
with.



Regards,
Oleg




Re: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int)

2014-06-06 Thread Jack Krupansky
The message does seem a little odd in that it refers to “assign”, but it would 
make more sense to say “compare”.

-- Jack Krupansky

From: Kevin Burton 
Sent: Friday, June 6, 2014 1:22 AM
To: user@cassandra.apache.org 
Subject: Bad Request: Type error: cannot assign result of function token (type 
bigint) to id (type int)

I'm building a new schema which I need to read externally by paging through the 
result set. 

My understanding from reading the documentation , and this list, is that I can 
do that but I need to use the token() function.

Only it doesn't work.

Here's a reduction:


create table test_paging (
id int,
primary key(id)
);


insert into test_paging (id) values (1);
insert into test_paging (id) values (2);
insert into test_paging (id) values (3);
insert into test_paging (id) values (4);
insert into test_paging (id) values (5);


select * from test_paging where id > token(0);


… but it gives me:


Bad Request: Type error: cannot assign result of function token (type bigint) 
to id (type int)


… 

What's that about?  I can't find any documentation for this and there aren't 
any concise examples.


-- 


Founder/CEO Spinn3r.com

Location: San Francisco, CA
Skype: burtonator
blog: http://burtonator.wordpress.com
… or check out my Google+ profile

War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
people.

Re: Bad Request: Type error: cannot assign result of function token (type bigint) to id (type int)

2014-06-06 Thread Laing, Michael
select * from test_paging where *token(*id*)* > token(0);

ml


On Fri, Jun 6, 2014 at 1:47 AM, Jonathan Haddad  wrote:

> Sorry, the datastax docs are actually a bit better:
> http://www.datastax.com/documentation/cql/3.0/cql/cql_using/paging_c.html
>
> Jon
>
>
> On Thu, Jun 5, 2014 at 10:46 PM, Jonathan Haddad 
> wrote:
>
>> You should read through the token docs, it has examples and
>> specifications: http://cassandra.apache.org/doc/cql3/CQL.html#tokenFun
>>
>>
>> On Thu, Jun 5, 2014 at 10:22 PM, Kevin Burton  wrote:
>>
>>> I'm building a new schema which I need to read externally by paging
>>> through the result set.
>>>
>>> My understanding from reading the documentation , and this list, is that
>>> I can do that but I need to use the token() function.
>>>
>>> Only it doesn't work.
>>>
>>> Here's a reduction:
>>>
>>>
>>> create table test_paging (
>>> id int,
>>> primary key(id)
>>> );
>>>
>>> insert into test_paging (id) values (1);
>>> insert into test_paging (id) values (2);
>>> insert into test_paging (id) values (3);
>>> insert into test_paging (id) values (4);
>>> insert into test_paging (id) values (5);
>>>
>>> select * from test_paging where id > token(0);
>>>
>>> … but it gives me:
>>>
>>> Bad Request: Type error: cannot assign result of function token (type
>>> bigint) to id (type int)
>>>
>>> …
>>>
>>> What's that about?  I can't find any documentation for this and there
>>> aren't any concise examples.
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> Skype: *burtonator*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> 
>>> 
>>> War is peace. Freedom is slavery. Ignorance is strength. Corporations
>>> are people.
>>>
>>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>


Re: CQLSSTableWriter memory leak

2014-06-06 Thread Xu Zhongxing
We figured out the reason for the growing memory usage. When adding rows, if 
flush-to-disk operation is done in SStableSimpleUnsortedWriter.newRow(). But 
for the compound primary key case, when the clustering key is identical, there 
is no new row created. So the single huge row is kept in the memory and no disk 
sync() is done.






在 2014-06-06 00:16:13,"Jack Krupansky"  写道:

How many rows (primary key values) are you writing for each partition of the 
primary key? I mean, are there relatively few, or are these very wide 
partitions?
 
Oh, I see! You’re writing 50,000,000 rows to a single partition! My, that IS 
ambitious.
 
-- Jack Krupansky
 
From:Xu Zhongxing
Sent: Thursday, June 5, 2014 3:34 AM
To:user@cassandra.apache.org
Subject: CQLSSTableWriter memory leak
 

I am using Cassandra's CQLSSTableWriter to import a large amount of data into 
Cassandra. When I use CQLSSTableWriter to write to a table with compound 
primary key, the memory consumption keeps growing. The GC of JVM cannot collect 
any used memory. When writing to tables with no compound primary key, the JVM 
GC works fine.

My Cassandra version is 2.0.5. The OS is Ubuntu 14.04 x86-64. JVM parameters 
are -Xms1g -Xmx2g. This is sufficient for all other non-compound primary key 
cases.

The problem can be reproduced by the following test case:

import org.apache.cassandra.io.sstable.CQLSSTableWriter;
import org.apache.cassandra.exceptions.InvalidRequestException;

import java.io.IOException;
import java.util.UUID;

class SS {
public static void main(String[] args) {
String schema = "create table test.t (x uuid, y uuid, primary key (x, 
y))";


String insert = "insert into test.t (x, y) values (?, ?)";
CQLSSTableWriter writer = CQLSSTableWriter.builder()
.inDirectory("/tmp/test/t")
.forTable(schema).withBufferSizeInMB(32)
.using(insert).build();

UUID id = UUID.randomUUID();
try {
for (int i = 0; i < 5000; i++) {
UUID id2 = UUID.randomUUID();
writer.addRow(id, id2);
}

writer.close();
} catch (Exception e) {
System.err.println("hell");
}
}
}