Late chime in: Dylan and Russ are on the money. A combiner is the way to go.

And since there was some confusion on the matter, a table with 100 '1' values for a given key would require the tabletserver to sum these values, and then return them to the client. After the table compacts (assuming a full compaction), these 100 '1' values would be rewritten on disk to 1 '100' value. The beauty of this is that, as an application, you don't have to know whether the values were combined by a tabletserver before you saw them or if the value came directly from disk. You just get a strongly consistent view of the data in your table.

Assuming you do go the combiner route, beware of writing a single '1' update for every "term" you see. If you can do some batching of updates to your stats table before writing to Accumulo (split the combination work between your client and the servers), you should see better throughput than sending lots of updates to the stats table.

Dylan Hutchison wrote:
Sounds like you have the idea now Z.  There are three places an iterator
can be applied: scan time, minor compaction time, and major compaction
time.  Minor compactions help your case a lot-- when enough entries are
written to a tablet server that the tablet server needs to dump them to a
new Hadoop RFile, the minor compaction iterators run on the entries as they
stream to the RFile.  This means that each RFile has only one entry for
each unique (row, column family, column qualifier) tuple.

Entries with the same (row, column family, column qualifier) in distinct
RFiles will get combined at the next major compaction, or on the fly during
the next scan.

For example, let say there are 100 rows of [foo, 1], it will actually be
'combined' to a single row [foo, 100]?


Careful-- Accumulo's combiners combine on Keys with identical row, column
family and column qualifier.  You'd have to make a more fancy iterator if
you want to combine all the entries that share the same row.  Let us know
if you need help doing that.


On Thu, Aug 27, 2015 at 3:09 PM, z11373<[email protected]>  wrote:

Thanks again Russ!

"but it might not be in this case if most of the data has already been
combined"
Does this mean Accumulo actually combine and persist the combined result
after the scan/compaction (depending on which op the combiner is applied)?
For example, let say there are 100 rows of [foo, 1], it will actually be
'combined' to a single row [foo, 100]? If that is the case, then combiner
is
not expensive.

Wow! that's brilliant using -1 approach, I didn't even think about it
before. Yes, this will work for my case because i only need to know the
count.

Thanks,
Z



--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14988.html
Sent from the Developers mailing list archive at Nabble.com.


Reply via email to