>
> If you can do a merge sort insertion, then you can guarantee order and
> it's fine.
>
Yep, I guarantee the iterator we add as a side channel will emit tuples in
sorted order.
On a suggestion from David Medinets, I modified my testing code to use a
MiniAccumuloCluster set to 2 tablet servers. I then set a table split on
"row3" before launching the compaction. The result looks good. Here is
output from a run on a local Accumulo instance. Note that we write more
values than we read.
2015-02-16 02:44:51,125 [tserver.Tablet] DEBUG: Starting MajC k;row3<
(USER) [hdfs://localhost:9000/accumulo/tables/k/t-00000g4/F00000g5.rf] -->
hdfs://localhost:9000/accumulo/tables/k/t-00000g4/A00000g7.rf_tmp
[name:InjectIterator, priority:15,
class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
2015-02-16 02:44:51,127 [tserver.Tablet] DEBUG: Starting MajC k<;row3
(USER) [hdfs://localhost:9000/accumulo/tables/k/default_tablet/F00000g6.rf]
--> hdfs://localhost:9000/accumulo/tables/k/default_tablet/A00000g8.rf_tmp
[name:InjectIterator, priority:15,
class:edu.mit.ll.graphulo.InjectIterator, properties:{}]
2015-02-16 02:44:51,190 [tserver.Compactor] DEBUG: *Compaction k<;row3 2
read | 4 written* | 111 entries/sec | 0.018 secs
2015-02-16 02:44:51,194 [tserver.Compactor] DEBUG: *Compaction k;row3< 1
read | 4 written* | 43 entries/sec | 0.023 secs
In addition, output from the DebugIterator looks as expected. There is a
re-seek after reading the first tablet to the key after the last entry
returned in the first tablet.
DEBUG:
init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@15085e63,
{}, org.apache.accumulo.tserver.TabletIteratorEnvironment@586cc05e)
DEBUG: 0x1C2BFB13 seek((-inf,+inf), [], false)
... <snipped logs>
DEBUG:
init(org.apache.accumulo.core.iterators.system.SynchronizedIterator@2b048c59,
{}, org.apache.accumulo.tserver.TabletIteratorEnvironment@379a3d1f)
DEBUG: 0x5946E74B seek([row2 colF3:colQ3 [] 9223372036854775807
false,+inf), [], false)
It seems the side channel strategy will hold up. We have opened a new
world of Accumulo-foo. Of course, the real test is a multi-node instance
with more than 10 entries of data.
Regards, Dylan
On Sun, Feb 15, 2015 at 11:17 PM, Andrew Wells <[email protected]>
wrote:
> The main issue with adding data in an iterator is order. If you have can
> do a merge sort insertion, then you can guarantee order and its fine. But
> if you are inserting base on input you cannot guarantee order, and it can
> only be on scan iterator.
> On Feb 15, 2015 8:03 PM, "Dylan Hutchison" <[email protected]> wrote:
>
>> Hello all,
>>
>> I've been toying with the registerSideChannel(iter)
>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/IteratorEnvironment.html#registerSideChannel(org.apache.accumulo.core.iterators.SortedKeyValueIterator)>
>> method
>> on the IteratorEnvironment passed to iterators through the init() method.
>> From what I can tell, the method allows you to add another iterator as a
>> top level source, to be merged in along with other usual top-level sources
>> such as the in-memory cache and RFiles.
>>
>> Are there any downsides to using registerSideChannel( ) to "add new data"
>> to an iterator chain? It looks like this is fairly stable, so long as the
>> iterator we add as a side channel implements seek() properly so as to only
>> return entries whose rows are within a tablet. I imagine it works like so:
>>
>> Suppose we set a custom iterator InjectIterator that registers a side
>> channel inside init() at priority 5 as a one-time major compaction
>> iterator. InjectIterator forwards other operations to its parent, as in
>> WrappingIterator
>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/WrappingIterator.html>.
>> We start the compaction:
>>
>> Tablet 1 (a,g]
>>
>> 1. init() called on InjectIterator. Creates the side channel
>> iterator, calls init() on it, and registers it.
>> 2. init() called on VersioningIterator.
>> 3. init() called on top level iterators, including Rfiles, in-memory
>> cache and the new side channel.
>> 4. seek( (a,g] ) called on InjectIterator.
>> 5. seek( (a,g] ) called on VersioningIterator.
>> 6. seek( (a,g] ) called on top level iterators
>> 7. next() called on InjectIterator. Forwards to parent.
>> 8. next() called on VersioningIterator. Forwards to parent.
>> 9. next() called on top level iterator (a MultiIterator
>>
>> <https://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/iterators/system/MultiIterator.html>).
>> The next value is read from all the top-level iterator sources and the one
>> with the least key is cached ready to go.
>> 10. ...
>>
>> Tablet 2 (g,p) --- same as tablet 1 except steps 4-6 call seek( (g,p)
>> ). Done in parallel with tablet 1 if on a different tablet server.
>>
>> Is this an accurate depiction? Anything I should treat with caution? It
>> seems to work on my single-node instance, so tips about difficulties going
>> to multi-node are good.
>>
>> Code available here.
>> <https://github.com/Accla/d4m_api_java/blob/0d8c62164d5c0b59f949ce23c1b85536809764d2/src/main/java/edu/mit/ll/graphulo/InjectIterator.java#L166>
>>
>> Regards,
>> Dylan Hutchison
>>
>> --
>> www.cs.stevens.edu/~dhutchis
>>
>
--
www.cs.stevens.edu/~dhutchis