[ 
https://issues.apache.org/jira/browse/CASSANDRA-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13048267#comment-13048267
 ] 

Stu Hood commented on CASSANDRA-2062:
-------------------------------------

bq. I don't think I understand the problem we're trying to solve here. If 
hasNext is expensive on a given iterator, the straightforward fix is to cache 
the answer (and invalidate it on next()).
hasNext is only expensive if the current item hasn't been consumed: in order to 
answer "hasNext" in a block based format (where the length of the entire row is 
not available) you have to read the remainder of the row, which means scanning 
the data and potentially seeking.

The fix this ticket implements is to guarantee that hasNext is called lazily 
once the consumer of the MergeIterator calls hasNext, rather than eagerly after 
an item (row) is generated, as implemented in CollatingIterator. To avoid 
seeking, we need to guarantee:

# A = next()
# A is completely consumed (see [1])
# hasNext() ?
# B = next()

The Collating+ReducingIterator used on SSTableScanners by CompactionIterator is 
the particular area of concern. The items that are being iterated over are 
rows, which are themselves IColumnIterators. An SSTableScanner in trunk reads 
the row length from the front of the row, so hasNext does not require a seek. 
But CASSANDRA-674 will _not_ store a row length anywhere (it's a feature!), and 
there might be multiple blocks involved in a single IColumnIterator.

[1] in CASSANDRA-2629, "consumption" of an IColumnIterator is added as a 
contract on close()

> Better control of iterator consumption
> --------------------------------------
>
>                 Key: CASSANDRA-2062
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2062
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>            Priority: Minor
>             Fix For: 1.0
>
>         Attachments: 
> 0001-CASSANDRA-2062-0001-Improved-iterator-for-merging-sort.txt, 
> 0002-CASSANDRA-2062-0002-Port-all-collating-consumers-to-Me.txt
>
>
> The core reason for this ticket is to gain control over the consumption of 
> the lazy nested iterators in the read path.
> {quote}We survive now because we write the size of the row at the front of 
> the row (via some serious acrobatics at write time), which gives us hasNext() 
> for rows for free. But it became apparent while working on the block-based 
> format that hasNext() will not be cheap unless the current item has been 
> consumed. "Consumption" of the row is easy, and blocks will be framed so that 
> they can be very easily skipped, but you don't want to have to seek to the 
> end of the row to answer hasNext, and then seek back to the beginning to 
> consume the row, which is what CollatingIterator would have forced us to 
> do.{quote}
> While we're at it, we can also improve efficiency: for {{M}} iterators 
> containing {{N}} total items, commons.collections.CollatingIterator performs 
> a {{O(M*N)}} merge, and calls hasNext multiple times per returned value. We 
> can do better.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to