Hi All,

I want to validate the use cases for de-duplication that will be going as
part of this implementation.

   - *Bounded data set*
      - This is de-duplication for bounded data. For example, data sets
      which are old or fixed or which may not have a time field at
all. Example:
      Last year's transaction records or Customer data etc.
      - Concept of expiry is not needed as this is bounded data set.
      - *Unbounded data set*
      - This is de-duplication of online streaming data
      - Expiry is needed because here incoming tuples may arrive later than
      what they are expected. Expiry is always computed by taking the
difference
      in System time and the Event time.

Any feedback is appreciated.

Thanks.

~ Bhupesh

On Mon, Jun 27, 2016 at 11:34 AM, Bhupesh Chawda <bhup...@datatorrent.com>
wrote:

> Hi All,
>
> I am working on adding a De-duplication operator in Malhar library based
> on managed state APIs. I will be working off the already created JIRA -
> https://issues.apache.org/jira/browse/APEXMALHAR-1701 and the initial
> pull request for an AbstractDeduper here:
> https://github.com/apache/apex-malhar/pull/260/files
>
> I am planning to include the following features in the first version:
> 1. Time based de-duplication. Assumption: Tuple_Key -> Tuple_Time
> correlation holds.
> 2. Option to maintain order of incoming tuples.
> 3. Duplicate and Expired ports to emit duplicate and expired tuples
> respectively.
>
> Thanks.
>
> ~ Bhupesh
>

Reply via email to