There is also the set batch size method[1] on the scanner. I think that defaults to 1000. You could try adjusting that.
[1]: http://accumulo.apache.org/1.8/apidocs/org/apache/accumulo/core/client/Scanner.html#setBatchSize%28int%29 On Mon, Dec 5, 2016 at 12:10 PM, Mario Pastorelli <[email protected]> wrote: > table.scan.max.memory doesn't affect the number of seeks in our case. We > tried with 1MB and 2MB. > > On Mon, Dec 5, 2016 at 5:59 PM, Josh Elser <[email protected]> wrote: >> >> If you're only ever doing sequential scans, IMO, it's expected that HDFS >> would be faster. Remember that, architecturally, Accumulo is designed for >> *random-read/write* workloads. This is where it would shine in comparison to >> HDFS. Accumulo is always going to have a hit in sequential read/write >> workloads over HDFS. >> >> As to your question about the number of seeks, try playing with the value >> of "table.scan.max.memory" [1]. You should be able to easily twiddle the >> value in the Accumulo shell and re-run the test. Accumulo tears down these >> active scans because it expects that your client would be taking time to >> process the results it just sent and it would want to not hold onto those in >> memory (as your client may not come back). Increasing that property will >> increase the amount of data sent in one RPC which in turn will reduce the >> number of RPCs and seeks. Aside: I think this server-side "scanner" lifetime >> is something that'd we want to revisit sooner than later. >> >> 25MB/s seems like a pretty reasonable a read rate for one TabletServer >> (since you only have one tablet). Similarly, why a BatchScanner would have >> made no difference. BatchScanners parallelize access to multiple Tablets and >> would have nothing but overhead when you read from a single Tablet. >> >> [1] >> http://accumulo.apache.org/1.7/accumulo_user_manual.html#_table_scan_max_memory >> >> Mario Pastorelli wrote: >>> >>> We are trying to understand Accumulo performance to better plan our >>> future products that use it and we noticed that the read speed of >>> Accumulo tends to be way lower than what we would expect. We have a >>> testing cluster with 4 HDFS+Accumulo nodes and we ran some tests. We >>> wrote two programs to write to HDFS and Accumulo and two programs to >>> read/scan from HDFS and Accumulo the same number of records containing >>> random bytes. We run all the programs from outside the cluster, on >>> another node of the rack that doesn’t have HDFS nor Accumulo. >>> >>> We also wrote all the HDFS blocks and Accumulo tablets on the same >>> machine of the cluster. >>> >>> >>> First of all, we wrote 10M entries to HDFS were each entry was 50 bytes >>> each. This resulted in 4 blocks on HDFS. Reading this records with a >>> FSDataInputStream takes around 5.7 seconds with an average speed of >>> around 90MB per second. >>> >>> >>> Then we wrote 10M entries to HDFS where each entry has a row of 50 >>> random bytes, no column and no value. Writing is as fast as writing to >>> HDFS modulo the compaction that we run at the end. The generated table >>> has 1 tablet and obviously 10M records all on the same cluster. We >>> waited for the compaction to finish, then we opened a scanner without >>> setting the range and we read all the records. This time, reading the >>> data took around 20 seconds with average speed of 25MB/s and 500000 >>> records/s together with ~500 seeks/s. We have two questions about this >>> result: >>> >>> >>> 1 - is this kind of performance expected? >>> >>> 2 - Is there any configuration that we can change to improve the scan >>> speed? >>> >>> 3 - why there are 500 seeks if there is only one tablet and we read >>> sequentially all its bytes? What are those seeks doing? >>> >>> >>> We tried to use a BatchScanner with 1, 5 and 10 threads but the speed >>> was the same or even worse in some cases. >>> >>> >>> I can provide the code that we used as well as information about our >>> cluster configuration if you want. >>> >>> Thanks, >>> >>> Mario >>> >>> -- >>> Mario Pastorelli| TERALYTICS >>> >>> *software engineer* >>> >>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>> phone:+41794381682 >>> email: [email protected] >>> <mailto:[email protected]> >>> www.teralytics.net <http://www.teralytics.net/> >>> >>> Company registration number: CH-020.3.037.709-7 | Trade register Canton >>> Zurich >>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>> Yann de Vries >>> >>> This e-mail message contains confidential information which is for the >>> sole attention and use of the intended recipient. Please notify us at >>> once if you think that it may not be intended for you and delete it >>> immediately. >>> > > > > -- > Mario Pastorelli | TERALYTICS > > software engineer > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: [email protected] > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the sole > attention and use of the intended recipient. Please notify us at once if you > think that it may not be intended for you and delete it immediately.
