Re: HDFS vs Accumulo Performance

Josh Elser Mon, 05 Dec 2016 09:01:20 -0800

If you're only ever doing sequential scans, IMO, it's expected that HDFSwould be faster. Remember that, architecturally, Accumulo is designedfor *random-read/write* workloads. This is where it would shine incomparison to HDFS. Accumulo is always going to have a hit in sequentialread/write workloads over HDFS.

As to your question about the number of seeks, try playing with thevalue of "table.scan.max.memory" [1]. You should be able to easilytwiddle the value in the Accumulo shell and re-run the test. Accumulotears down these active scans because it expects that your client wouldbe taking time to process the results it just sent and it would want tonot hold onto those in memory (as your client may not come back).Increasing that property will increase the amount of data sent in oneRPC which in turn will reduce the number of RPCs and seeks. Aside: Ithink this server-side "scanner" lifetime is something that'd we want torevisit sooner than later.

25MB/s seems like a pretty reasonable a read rate for one TabletServer(since you only have one tablet). Similarly, why a BatchScanner wouldhave made no difference. BatchScanners parallelize access to multipleTablets and would have nothing but overhead when you read from a singleTablet.

[1]http://accumulo.apache.org/1.7/accumulo_user_manual.html#_table_scan_max_memory


Mario Pastorelli wrote:

We are trying to understand Accumulo performance to better plan our
future products that use it and we noticed that the read speed of
Accumulo tends to be way lower than what we would expect. We have a
testing cluster with 4 HDFS+Accumulo nodes and we ran some tests. We
wrote two programs to write to HDFS and Accumulo and two programs to
read/scan from HDFS and Accumulo the same number of records containing
random bytes. We run all the programs from outside the cluster, on
another node of the rack that doesn’t have HDFS nor Accumulo.

We also wrote all the HDFS blocks and Accumulo tablets on the same
machine of the cluster.


First of all, we wrote 10M entries to HDFS were each entry was 50 bytes
each. This resulted in 4 blocks on HDFS. Reading this records with a
FSDataInputStream takes around 5.7 seconds with an average speed of
around 90MB per second.


Then we wrote 10M entries to HDFS where each entry has a row of 50
random bytes, no column and no value. Writing is as fast as writing to
HDFS modulo the compaction that we run at the end. The generated table
has 1 tablet and obviously 10M records all on the same cluster. We
waited for the compaction to finish, then we opened a scanner without
setting the range and we read all the records. This time, reading the
data took around 20 seconds with average speed of 25MB/s and 500000
records/s together with ~500 seeks/s. We have two questions about this
result:


1 - is this kind of performance expected?

2 - Is there any configuration that we can change to improve the scan speed?

3 - why there are 500 seeks if there is only one tablet and we read
sequentially all its bytes? What are those seeks doing?


We tried to use a BatchScanner with 1, 5 and 10 threads but the speed
was the same or even worse in some cases.


I can provide the code that we used as well as information about our
cluster configuration if you want.

Thanks,

Mario

--
Mario Pastorelli| TERALYTICS

*software engineer*

Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
phone:+41794381682
email: [email protected]
<mailto:[email protected]>
www.teralytics.net <http://www.teralytics.net/>

Company registration number: CH-020.3.037.709-7 | Trade register Canton
Zurich
Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
Yann de Vries

This e-mail message contains confidential information which is for the
sole attention and use of the intended recipient. Please notify us at
once if you think that it may not be intended for you and delete it
immediately.

Re: HDFS vs Accumulo Performance

Reply via email to