On Thu, Jan 3, 2013 at 6:08 PM, Corey Nolet <cno...@texeltek.com> wrote: > That's funny you bring that up- because I was JUST discussing this as a > possibility with a coworker. Compaction is really the phase that I'm > concerned with- as the API for loading the data from the TopN currently only > allows you to load the last N keys/values for a single index at a time. > > Can I guarantee that compaction will pass each row through a single filter?
yes and no. The same iterator instance is used for an entire compaction and only inited and seeked once. However sometimes compactions only process a subset of a tablets files.. Therefore you can not garuntee you will see all columns in a row, you may only see subset. Also if you have locality groups enabled, each localitly group is compacted separately. > > > > > On Jan 3, 2013, at 5:54 PM, Keith Turner wrote: > >> Data is read from the iterators into a buffer. When the buffer fills >> up, the data is sent to the client and the iterators are reinitialized >> to fill up the next buffer. >> >> The default buffer size was changed from 50M to 1M at some point. >> This is configured via the property table.scan.max.memory >> >> The lower buffer size will cause iterator to be reinitialized more >> frequently. Maybe you are seeing this. >> >> Keith >> >> On Thu, Jan 3, 2013 at 5:41 PM, Corey Nolet <cno...@texeltek.com> wrote: >>> Hey Guys, >>> >>> In "Accumulo 1.3.5", I wrote a "Top N" table structure, services and a >>> FilteringIterator that would allow us to drop in several keys/values >>> associated with a UUID (similar to a document id). The UUID was further >>> associated with an "index" (or type). The purpose of the TopN table was to >>> keep the keys/values separated so that they could still be queried back with >>> cell-level tagging, but when I performed a query for an index, I would get >>> the last N UUIDs and further be able to query the keys/values for each of >>> those UUIDs. >>> >>> This problem seemed simple to solve in Accumulo 1.3.5, as I was able to >>> provide 2 FilteringIterators for compaction time to perform data cleanup of >>> the table so that any keys/values kept around were guaranteed to be inside >>> of the range of those keys being managed by the versioning iterator. >>> >>> Just to recap, I have the following table structure. I also hash the >>> keys/values and run a filter before the versioning iterator to clean up any >>> duplicates. There are two types of columns: index & key/value. >>> >>> >>> Index: >>> >>> R: index (or "type" of data) >>> F: '\x00index' >>> Q: empty >>> V: uuid\x00hashOfKeys&Values >>> >>> >>> Key/Value: >>> >>> R: index (or "type" of data) >>> F: uuid >>> Q: key\x00value >>> V: empty >>> >>> >>> The filtering iterator that makes sure any key/value rows are in the index >>> manages a hashset internally. The index rows are purposefully indexed before >>> the key/value rows so that the filter can build up the hashset containing >>> those uuids in the index. As the filter iterates into the key/value rows, it >>> will return true only if the uuid of the key/value exists inside of the >>> hashset containing the uuids in the index. This worked with older versions >>> of accumulo but I'm now getting a weird artifact where INIT() is called on >>> my Filter in the middle of iterating through an index row. >>> >>> More specifically, the Filter will iterate through the index rows of a >>> specific "index" and build up a hashset, then init() will be called which >>> wipes away the hashset of uuids, then the further goes on to iterate through >>> the key/value rows. Keep in mind, we are talking about maybe 400k entries, >>> not enough to have more than 1 tablet. >>> >>> Any idea why this may have worked on 1.3.5 but doesn't work any longer? I >>> know it has got to be a huge nono to be storing state inside of a filter, >>> but I haven't had any issues until trying to update my code for the new >>> version. If I'm doing this completely wrong, any ideas on how to make this >>> better? >>> >>> >>> Thanks! >>> >>> >>> -- >>> Corey Nolet >>> Senior Software Engineer >>> TexelTek, inc. >>> [Office] 301.880.7123 >>> [Cell] 410-903-2110 >