I wrote a blog post for the PayPal engineering blog detailing some of the (Pig) content I've contributed to DataFu on behalf of PayPal. The post contains documentation and code samples of three macros and a UDF:
*dedup* - for deduplicating rows based on a key and date updated fields *sample_by_keys* - a macro for generating a sample of a table based on a list of unique ids *diff_macro* - for generating a human readable diff between two tables *CountDistinctUpTo* - a UDF which performs much better than pure Pig for cases in which you don't need the actual records, but just to verify that a certain amount exists https://medium.com/paypal-engineering/a-guide-to-paypals-contributions-to-apache-datafu-b30cc25e0312 The blog post will be cross-posted to the Apache DataFu blog soon. Cheers, Eyal