This is very valuable. I have heard the following feature sets from customers.
- Ability to spool to hdfs (or any DFS interface) - Ability to pick and choose the tuple, i.e. not every tuple may need to be tracked - Minimal performance hit - The current api remains as is - Ability to get the content based on tuple-id Apex should enable this with minimal or no coding from users Thks, Amol On Mon, Apr 25, 2016 at 12:00 AM, Ashwin Chandra Putta < [email protected]> wrote: > Hi All, > > I have heard of a few use cases where lineage support is asked for. On > apex, it seems to be an ask for the ability to uniquely track each tuple as > it flows through the DAG. It further boils down to being able to track > every tuple going into each operator and the corresponding tuple going out > of the operator. Here are a quick list I put together to describe some > requirements for lineage support on apex. Please feel free to improve or > add to it. Also, please respond with ideas on how we can solve this on the > apex platform. > > When lineage is enabled, > 1. We should be able to track each tuple as it enters and exits an > operator. eg: enrichment. > 2. We should be able to track all the tuples that contributed to a tuple > that is emitted. eg: dimensions computation. > 3. We should be able to track all the tuples that contributed to all the > tuples emitted by the operator. eg: join? > > -- > > Regards, > Ashwin. >
