[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164373#comment-16164373 ] Eyal Allweil commented on DATAFU-61: One last thing - I noticed after I uploaded my patch that it has my email, but I think it would be better for it to have your email, [~russell.jurney], since all I did was write the test. Is it OK that I replace my email with yours before committing this, so we get a (more accurate) "eyal committed with russell" type commit? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165128#comment-16165128 ] Russell Jurney commented on DATAFU-61: -- That is fine with me :) > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Shepherd Report
Hey Eyal, sorry for the slow response. Since the community has already voted and passed the graduation proposal, the next step I believe is to start the discussion in Incubator general. I've been drafting that email but haven't sent it out yet. If that discussion is positive we would open it for a vote in Incubator general, etc. and so on as Dave mentioned. -Matt On Mon, Sep 11, 2017 at 8:05 AM, Dave Fisher wrote: > Hi - > > The project with your mentor's help have to discuss, create and vote on a > graduation resolution. Then you need to submit to the Incubator for a vote. > Once the Incubator approves the Board will need to approve at the next > monthly meeting. > > Regards, > Dave > > Sent from my iPhone > > > On Sep 11, 2017, at 1:44 AM, Eyal Allweil > wrote: > > > > Hi Dave, everyone - > > > > Is there any update with this? Are we graduating? > > > > Regards, > > Eyal > > > > > > On Thursday, August 3, 2017 12:24 AM, Matthew Hayes < > matthew.terence.ha...@gmail.com> wrote: > > > > > > Dave the website has been updated based on your feedback: > > > > http://datafu.incubator.apache.org/community/issues.html > > > > I updated QU30 in the maturity evaluation with the new info: > > https://cwiki.apache.org/confluence/display/DATAFU/Maturity+Evaluation > > > > -Matt > > > > On Wed, Aug 2, 2017 at 10:41 AM, Matthew Hayes < > > matthew.terence.ha...@gmail.com> wrote: > > > > > Thanks for pointing this out Dave. I'll write a document today on how > to > > > report issues for DataFu and include a section specifically about how > to > > > report security issues. > > > > > > -Matt > > > > > > > On Aug 2, 2017, at 9:52 AM, David Fisher wrote: > > > > > > > > Hi DataFu: > > > > > > > > It looks like you are about ready to graduate. Congratulations. > > > > > > > > I do have one issue to mention. Something that you need to change > > > quickly. Reporting of a security issue must not be to JIRA. JIRA is not > > > secure. Security issues must be reported privately. Either you can > refer to > > > the Foundation wide secur...@apache.org or you can use your > private@datafu > > > list. > > > > > > > > Regards, > > > > Dave > > > > > > > >
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165539#comment-16165539 ] Matthew Hayes commented on DATAFU-119: -- Yea the macro files should also have the license header. We should make sure the rat task verifies these files if it isn't already. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
Re: Review Request 49248: New UDF - TupleDiff
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/49248/#review185355 --- Ship it! Ship It! - Matthew Hayes On Sept. 11, 2017, 1:06 p.m., Eyal Allweil wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/49248/ > --- > > (Updated Sept. 11, 2017, 1:06 p.m.) > > > Review request for DataFu and Ido Hadanny. > > > Repository: datafu > > > Description > --- > > New UDF - TupleDiff > > > Diffs > - > > datafu-pig/src/main/java/datafu/pig/util/TupleDiff.java PRE-CREATION > datafu-pig/src/main/resources/datafu/diff_macros.pig PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/util/TupleDiffTest.java > PRE-CREATION > > > Diff: https://reviews.apache.org/r/49248/diff/2/ > > > Testing > --- > > > Thanks, > > Eyal Allweil > >
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165592#comment-16165592 ] Matthew Hayes commented on DATAFU-119: -- I checked the review board. Code looks good to me. Can you attach the patch here? Then I will merge it in. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165598#comment-16165598 ] Matthew Hayes commented on DATAFU-119: -- Also it would be useful to have a page in the guide that demonstrated the usage of TupleDiff and the macro. We can do this separately. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165605#comment-16165605 ] Matthew Hayes commented on DATAFU-61: - +1 updated patch looks good to me. [~eyal] want to merge this in? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)