Shelving is fine. When it comes time to do this, I thikn that the rationale for the pending stuff is that merging is roughly as expensive as a copy. Thus, we should be able to arrange that there really is a copy and the old arrays. This will allow any iterators to complete cleanly if the pending list isn't a problem. Presumably even the list can be made to work since it only has append operations and the iterator can snapshot the length.
On Wed, Mar 27, 2013 at 10:12 PM, Dan Filimon <dangeorge.fili...@gmail.com>wrote: > I realized that in fact just the pending array is not enough. There is > also the distances array that needs fixing. > In fact, an entire new copy is needed if we're to be able to safely > iterate and reindex. > > I'm shelving this for now. That ok? > > On Wed, Mar 27, 2013 at 4:35 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > > Another option is to make the iterator take a reference to the array as > it > > exists and then during merging always create a new array. > > > > A second option is to just let the iterator get a bit confused (don't > like > > the smell there). > > > > On Tue, Mar 26, 2013 at 10:59 PM, Dan Filimon > > <dangeorge.fili...@gmail.com>wrote: > > > >> Ted, everyone, > >> > >> I added the multiple pass BallKMeans at the end of the reducer and > >> have uncovered a series of issues with the searchers. > >> > >> FastProjectionSearch stores "pending" vectors in a separate list from > >> the main projections, the idea being that use arrays throughout rather > >> than a tree multiset to reduce the overhead. > >> > >> Here's what's happening: > >> - after having added a bunch of centroids to the searcher, > >> - we iterate through each centroid and search for its closest neighbor, > >> - when calling search(), reindex() is called internally that deals > >> with pending entries that haven't been merged in some time. When > >> reindex() mutates the data structures, the iterator we're using to go > >> through the centroids becomes invalid and a > >> ConcurrentModificationException is thrown. > >> > >> I'm not really sure what the best approach is to fix this. Here are > >> some options: > >> - expose reindex() and demand users to call it explicitly (which > >> completely kills abstraction); > >> - when instantiating an Iterator, create a helper ImmutableList that > >> keeps track of the pending values for the duration of the iteration > >> and when iterating go through these first and release the reference to > >> the list as soon as possible; > >> - come up with an alternative design and really think about how this > >> searcher is useful. > >> > >> In particular, when running minimal tests (SearchQualityTest), this > >> FPS is faster than ordinary ProjectionSearch but worse than LSH. > >> > >> Any thoughts? > >> >