I'd say that was an excellent set of requirements (very similar to the one we 
arrived on with the last discuss thread on this)

My vote remains a transaction log in hbase given the relatively low volume 
(human scale) i would not expect this to need anything fancy like compaction 
into hdfs state, but that does make a good argument for a long term dataframe 
solution for spark, with a short term stop gap using a joined data frame and 
shc.

Simon 

Sent from my iPhone

> On 22 Jun 2017, at 05:11, Otto Fowler <[email protected]> wrote:
> 
> Can you clarify what data stores are at play here?
> 
> 
> On June 21, 2017 at 17:07:42, Casey Stella ([email protected]) wrote:
> 
> Hi All,
> 
> I know we've had a couple of these already, but we're due for another
> discussion of a sensible approach to mutating indexed data. The motivation
> for this is users will want to update fields to correct and augment data.
> These corrections are invaluable for things like feedback for ML models or
> just plain providing better context when evaluating alerts, etc.
> 
> Rather than posing a solution, I'd like to pose the characteristics of a
> solution and we can fight about those first. ;)
> 
> In my mind, the following are the characteristics that I'd look for:
> 
> - Changes should be considered additional or replacement fields for
> existing fields
> - Changes need to be available in the web view in near real time (on the
> order of milliseconds)
> - Changes should be available in the batch view
> - I'd be ok with eventually consistent with the web view, thoughts?
> - Changes should have lineage preserved
> - Current value is the optimized path
> - Lineage search is the less optimized path
> - If HBase is part of a solution
> - maintain a scan-free solution
> - maintain a coprocessor-free solution
> 
> Most of what I've thought of is something along the lines:
> 
> - Diffs are stored in columns in a HBase row(s)
> - row: GUID:current would have one column with the current
> representation
> - row: GUID:lineage would have an ordered set of columns representing
> the lineage diffs
> - Mutable indices is directly updated (e.g. solr or ES)
> - We'd probably want to provide transparent read support downstream
> which supports merging for batch read:
> - a spark dataframe
> - a hive serde
> 
> What I'd like to get out of this discussion is an architecture document
> with a suggested approach and the necessary JIRAs to split this up. If
> anyone has suggestions or comments about any of this, please speak up. I'd
> like to actually get this done in the near-term. :)
> 
> Best,
> 
> Casey

Reply via email to