I don't have enough context to say definitively, but I'd assume earlier versions too.

Dan Blum wrote:
Is this a problem specific to 1.8.0, or is it likely to affect earlier versions?

-----Original Message-----
From: Josh Elser [mailto:[email protected]]
Sent: Saturday, September 10, 2016 6:01 PM
To: [email protected]
Subject: Re: Accumulo Seek performance

Sven, et al:

So, it would appear that I have been able to reproduce this one (better
late than never, I guess...). tl;dr Serially using Scanners to do point
lookups instead of a BatchScanner is ~20x faster. This sounds like a
pretty serious performance issue to me.

Here's a general outline for what I did.

* Accumulo 1.8.0
* Created a table with 1M rows, each row with 10 columns using YCSB
(workloada)
* Split the table into 9 tablets
* Computed the set of all rows in the table

For a number of iterations:
* Shuffle this set of rows
* Choose the first N rows
* Construct an equivalent set of Ranges from the set of Rows, choosing a
random column (0-9)
* Partition the N rows into X collections
* Submit X tasks to query one partition of the N rows (to a thread pool
with X fixed threads)

I have two implementations of these tasks. One, where all ranges in a
partition are executed via one BatchWriter. A second where each range is
executed in serial using a Scanner. The numbers speak for themselves.

** BatchScanners **
2016-09-10 17:51:38,811 [joshelser.YcsbBatchScanner] INFO : Shuffled all
rows
2016-09-10 17:51:38,843 [joshelser.YcsbBatchScanner] INFO : All ranges
calculated: 3000 ranges found
2016-09-10 17:51:38,846 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 40178 ms
2016-09-10 17:52:19,025 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 42296 ms
2016-09-10 17:53:01,321 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:53:47,414 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 46094 ms
2016-09-10 17:53:47,415 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:54:35,118 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 47704 ms
2016-09-10 17:54:35,119 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:55:24,339 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 49221 ms

** Scanners **
2016-09-10 17:57:23,867 [joshelser.YcsbBatchScanner] INFO : Shuffled all
rows
2016-09-10 17:57:23,898 [joshelser.YcsbBatchScanner] INFO : All ranges
calculated: 3000 ranges found
2016-09-10 17:57:23,903 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2833 ms
2016-09-10 17:57:26,738 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2536 ms
2016-09-10 17:57:29,275 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2150 ms
2016-09-10 17:57:31,425 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2061 ms
2016-09-10 17:57:33,487 [joshelser.YcsbBatchScanner] INFO : Executing 6
range partitions using a pool of 6 threads
2016-09-10 17:57:35,628 [joshelser.YcsbBatchScanner] INFO : Queries
executed in 2140 ms

Query code is available https://github.com/joshelser/accumulo-range-binning

Sven Hodapp wrote:
Hi Keith,

I've tried it with 1, 2 or 10 threads. Unfortunately there where no amazing 
differences.
Maybe it's a problem with the table structure? For example it may happen that 
one row id (e.g. a sentence) has several thousand column families. Can this 
affect the seek performance?

So for my initial example it has about 3000 row ids to seek, which will return 
about 500k entries. If I filter for specific column families (e.g. a document 
without annotations) it will return about 5k entries, but the seek time will 
only be halved.
Are there to much column families to seek it fast?

Thanks!

Regards,
Sven


Reply via email to