[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2018-01-02 Thread Matthew Hayes (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Hayes updated DATAFU-119:
-
Fix Version/s: 1.3.3

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Fix For: 1.3.3
>
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-119:

Attachment: DATAFU-119-2.patch

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)