I'd say that was an excellent set of requirements (very similar to the one we arrived on with the last discuss thread on this)
My vote remains a transaction log in hbase given the relatively low volume (human scale) i would not expect this to need anything fancy like compaction into hdfs state, but that does make a good argument for a long term dataframe solution for spark, with a short term stop gap using a joined data frame and shc. Simon Sent from my iPhone > On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote: > > Can you clarify what data stores are at play here? > > > On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote: > > Hi All, > > I know we've had a couple of these already, but we're due for another > discussion of a sensible approach to mutating indexed data. The motivation > for this is users will want to update fields to correct and augment data. > These corrections are invaluable for things like feedback for ML models or > just plain providing better context when evaluating alerts, etc. > > Rather than posing a solution, I'd like to pose the characteristics of a > solution and we can fight about those first. ;) > > In my mind, the following are the characteristics that I'd look for: > > - Changes should be considered additional or replacement fields for > existing fields > - Changes need to be available in the web view in near real time (on the > order of milliseconds) > - Changes should be available in the batch view > - I'd be ok with eventually consistent with the web view, thoughts? > - Changes should have lineage preserved > - Current value is the optimized path > - Lineage search is the less optimized path > - If HBase is part of a solution > - maintain a scan-free solution > - maintain a coprocessor-free solution > > Most of what I've thought of is something along the lines: > > - Diffs are stored in columns in a HBase row(s) > - row: GUID:current would have one column with the current > representation > - row: GUID:lineage would have an ordered set of columns representing > the lineage diffs > - Mutable indices is directly updated (e.g. solr or ES) > - We'd probably want to provide transparent read support downstream > which supports merging for batch read: > - a spark dataframe > - a hive serde > > What I'd like to get out of this discussion is an architecture document > with a suggested approach and the necessary JIRAs to split this up. If > anyone has suggestions or comments about any of this, please speak up. I'd > like to actually get this done in the near-term. :) > > Best, > > Casey
