[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471164#comment-15471164 ]
Eyal Allweil commented on DATAFU-119: ------------------------------------- Any feedback about this? > New UDF - TupleDiff > ------------------- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,<original tuple>, > missing,,<new tuple> > changed field2 field4,<original tuple>,<new tuple> > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)