The performance difference of online bulk insertion and the file-based bulk loading

José Elias Queiroga da Costa Araújo Tue, 15 Oct 2013 05:44:01 -0700

       Hi all.

We are using the Cassandra 1.2 StorageServiceMBean class (using JMX Bulk
Loader) to load the DB image into the Cassandra cluster.  After the DB
image loading, we issued the bulk retrieval to get the data back using the
Hector API’s multigetSliceQuery. Let’s call the method file-based bulk
loading.


Alternatively, we used the Hector API to do the online batch-insertion
(Mutator.addInsertion). Let’s call this second method
online-bulk-insertion. After this online-bulk-insertion, we issued the bulk
retrieval to get the data back using the Hector API’s multigetSliceQuery.

We find that the retrieval performance is about 10 time better, after the
online-bulk-insertion, compared to the file-based bulk loading.

 Certainly, the explanation is that after the online-bulk-insertion, the
inserted data is in the memtable, and that certainly speed up the
subsequent bulk retrieval.

My questions are:

- is that is there a way that we can warm-up the cache, after the
file-based bulk loading, so that we can allow the data to be cached first
in the memory, and then afterwards, when we issue the bulk retrieval, the
performance can be closer to what is provided by the online-bulk-insertion.
 - Will sstableloader provide in cassandra’s bin directory perform
differently, compared to JMX Bulk Loader?
- Do I need to wait  for some time after the JmxBulkLoader loading or
sstableLoader’s loading, before I can issue the bulk retrieval call, as the
Cassandra cluster is doing some house keeping, such as Index building, for
the newly bulk loaded data?


I did try bin/nodetool refresh, after the file-based bulk loading, and I
did not see the effect

       Thanks in advance.

        Elias.

The performance difference of online bulk insertion and the file-based bulk loading

Reply via email to