Re: HDFS vs Accumulo Performance

Keith Turner Mon, 05 Dec 2016 09:22:20 -0800

There is also the set batch size method[1] on the scanner.  I think
that defaults to 1000.   You could try adjusting that.


[1]: 
http://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/Scanner.html#setBatchSize%28int%29

On Mon, Dec 5, 2016 at 12:10 PM, Mario Pastorelli
<[email protected]> wrote:
> table.scan.max.memory doesn't affect the number of seeks in our case. We
> tried with 1MB and 2MB.
>
> On Mon, Dec 5, 2016 at 5:59 PM, Josh Elser <[email protected]> wrote:
>>
>> If you're only ever doing sequential scans, IMO, it's expected that HDFS
>> would be faster. Remember that, architecturally, Accumulo is designed for
>> *random-read/write* workloads. This is where it would shine in comparison to
>> HDFS. Accumulo is always going to have a hit in sequential read/write
>> workloads over HDFS.
>>
>> As to your question about the number of seeks, try playing with the value
>> of "table.scan.max.memory" [1]. You should be able to easily twiddle the
>> value in the Accumulo shell and re-run the test. Accumulo tears down these
>> active scans because it expects that your client would be taking time to
>> process the results it just sent and it would want to not hold onto those in
>> memory (as your client may not come back). Increasing that property will
>> increase the amount of data sent in one RPC which in turn will reduce the
>> number of RPCs and seeks. Aside: I think this server-side "scanner" lifetime
>> is something that'd we want to revisit sooner than later.
>>
>> 25MB/s seems like a pretty reasonable a read rate for one TabletServer
>> (since you only have one tablet). Similarly, why a BatchScanner would have
>> made no difference. BatchScanners parallelize access to multiple Tablets and
>> would have nothing but overhead when you read from a single Tablet.
>>
>> [1]
>> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_table_scan_max_memory
>>
>> Mario Pastorelli wrote:
>>>
>>> We are trying to understand Accumulo performance to better plan our
>>> future products that use it and we noticed that the read speed of
>>> Accumulo tends to be way lower than what we would expect. We have a
>>> testing cluster with 4 HDFS+Accumulo nodes and we ran some tests. We
>>> wrote two programs to write to HDFS and Accumulo and two programs to
>>> read/scan from HDFS and Accumulo the same number of records containing
>>> random bytes. We run all the programs from outside the cluster, on
>>> another node of the rack that doesn’t have HDFS nor Accumulo.
>>>
>>> We also wrote all the HDFS blocks and Accumulo tablets on the same
>>> machine of the cluster.
>>>
>>>
>>> First of all, we wrote 10M entries to HDFS were each entry was 50 bytes
>>> each. This resulted in 4 blocks on HDFS. Reading this records with a
>>> FSDataInputStream takes around 5.7 seconds with an average speed of
>>> around 90MB per second.
>>>
>>>
>>> Then we wrote 10M entries to HDFS where each entry has a row of 50
>>> random bytes, no column and no value. Writing is as fast as writing to
>>> HDFS modulo the compaction that we run at the end. The generated table
>>> has 1 tablet and obviously 10M records all on the same cluster. We
>>> waited for the compaction to finish, then we opened a scanner without
>>> setting the range and we read all the records. This time, reading the
>>> data took around 20 seconds with average speed of 25MB/s and 500000
>>> records/s together with ~500 seeks/s. We have two questions about this
>>> result:
>>>
>>>
>>> 1 - is this kind of performance expected?
>>>
>>> 2 - Is there any configuration that we can change to improve the scan
>>> speed?
>>>
>>> 3 - why there are 500 seeks if there is only one tablet and we read
>>> sequentially all its bytes? What are those seeks doing?
>>>
>>>
>>> We tried to use a BatchScanner with 1, 5 and 10 threads but the speed
>>> was the same or even worse in some cases.
>>>
>>>
>>> I can provide the code that we used as well as information about our
>>> cluster configuration if you want.
>>>
>>> Thanks,
>>>
>>> Mario
>>>
>>> --
>>> Mario Pastorelli| TERALYTICS
>>>
>>> *software engineer*
>>>
>>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
>>> phone:+41794381682
>>> email: [email protected]
>>> <mailto:[email protected]>
>>> www.teralytics.net <http://www.teralytics.net/>
>>>
>>> Company registration number: CH-020.3.037.709-7 | Trade register Canton
>>> Zurich
>>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz,
>>> Yann de Vries
>>>
>>> This e-mail message contains confidential information which is for the
>>> sole attention and use of the intended recipient. Please notify us at
>>> once if you think that it may not be intended for you and delete it
>>> immediately.
>>>
>
>
>
> --
> Mario Pastorelli | TERALYTICS
>
> software engineer
>
> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland
> phone: +41794381682
> email: [email protected]
> www.teralytics.net
>
> Company registration number: CH-020.3.037.709-7 | Trade register Canton
> Zurich
> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann
> de Vries
>
> This e-mail message contains confidential information which is for the sole
> attention and use of the intended recipient. Please notify us at once if you
> think that it may not be intended for you and delete it immediately.

Re: HDFS vs Accumulo Performance

Reply via email to