I wrote a blog post for the PayPal engineering blog detailing some of the
(Pig) content I've contributed to DataFu on behalf of PayPal. The post
contains documentation and code samples of three macros and a UDF:

*dedup* - for deduplicating rows based on a key and date updated fields

*sample_by_keys* - a macro for generating a sample of a table based on a
list of unique ids

*diff_macro* - for generating a human readable diff between two tables

*CountDistinctUpTo* - a UDF which performs much better than pure Pig for
cases in which you don't need the actual records, but just to verify that a
certain amount exists

https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312

The blog post will be cross-posted to the Apache DataFu blog soon.

Cheers,
Eyal

Reply via email to