Hi, I'm running a single-threaded ingestion program that takes data from an
input source, parses it into mutations, and then writes those mutations
(sequentially) to four different BatchWriters (all on different tables). Most
of the time (95%) taken is on adding mutations, e.g.
batchWriter.addMutations(mutations); I am wondering how to reduce the time
taken by these methods.
1) For the method batchWriter.addMutations(Iterable<Mutation>), does it matter
for performance whether the mutations returned by the iterator are sorted in
lexicographic order?
2) If the Iterable<Mutation> that I pass to the BatchWriter is very large, will
I need to wait for a number of Batches to be written and flushed before it will
finish iterating, or does it transfer the elements of the Iterable to a
different intermediate list?
3) If that is the case, would it then make sense to spawn off short threads for
each time I make use of addMutations?
At a high level, my code looks like this:
BatchWriter bw1 = connector.createBatchWriter(...)
BatchWriter bw2 = ...
...
while(true) {
String[] data = input.getData();
List<Mutation> mutations1 = parseData1(data);
List<Mutation> mutations2 = parseData2(data);
...
bw1.addMutations(mutations1);
bw2.addMutations(mutations2);
...
}
Thanks,
David