[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164373#comment-16164373
 ] 

Eyal Allweil commented on DATAFU-61:


One last thing - I noticed after I uploaded my patch that it has my email, but 
I think it would be better for it to have your email, [~russell.jurney], since 
all I did was write the test. Is it OK that I replace my email with yours 
before committing this, so we get a (more accurate) "eyal committed with 
russell" type commit?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-13 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165128#comment-16165128
 ] 

Russell Jurney commented on DATAFU-61:
--

That is fine with me :)

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Shepherd Report

2017-09-13 Thread Matthew Hayes
Hey Eyal, sorry for the slow response.  Since the community has already
voted and passed the graduation proposal, the next step I believe is to
start the discussion in Incubator general.  I've been drafting that email
but haven't sent it out yet.  If that discussion is positive we would open
it for a vote in Incubator general, etc. and so on as Dave mentioned.

-Matt

On Mon, Sep 11, 2017 at 8:05 AM, Dave Fisher  wrote:

> Hi -
>
> The project with your mentor's help have to discuss, create and vote on a
> graduation resolution. Then you need to submit to the Incubator for a vote.
> Once the Incubator approves the Board will need to approve at the next
> monthly meeting.
>
> Regards,
> Dave
>
> Sent from my iPhone
>
> > On Sep 11, 2017, at 1:44 AM, Eyal Allweil 
> wrote:
> >
> > Hi Dave, everyone -
> >
> > Is there any update with this? Are we graduating?
> >
> > Regards,
> > Eyal
> >
> >
> > On Thursday, August 3, 2017 12:24 AM, Matthew Hayes <
> matthew.terence.ha...@gmail.com> wrote:
> >
> >
> > Dave the website has been updated based on your feedback:
> >
> > http://datafu.incubator.apache.org/community/issues.html
> >
> > I updated QU30 in the maturity evaluation with the new info:
> > https://cwiki.apache.org/confluence/display/DATAFU/Maturity+Evaluation
> >
> > -Matt
> >
> > On Wed, Aug 2, 2017 at 10:41 AM, Matthew Hayes <
> > matthew.terence.ha...@gmail.com> wrote:
> >
> > > Thanks for pointing this out Dave. I'll write a document today on how
> to
> > > report issues for DataFu and include a section specifically about how
> to
> > > report security issues.
> > >
> > > -Matt
> > >
> > > > On Aug 2, 2017, at 9:52 AM, David Fisher  wrote:
> > > >
> > > > Hi DataFu:
> > > >
> > > > It looks like you are about ready to graduate. Congratulations.
> > > >
> > > > I do have one issue to mention. Something that you need to change
> > > quickly. Reporting of a security issue must not be to JIRA. JIRA is not
> > > secure. Security issues must be reported privately. Either you can
> refer to
> > > the Foundation wide secur...@apache.org or you can use your
> private@datafu
> > > list.
> > > >
> > > > Regards,
> > > > Dave
> > >
> >
> >
>


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-13 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165539#comment-16165539
 ] 

Matthew Hayes commented on DATAFU-119:
--

Yea the macro files should also have the license header.  We should make sure 
the rat task verifies these files if it isn't already.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Review Request 49248: New UDF - TupleDiff

2017-09-13 Thread Matthew Hayes

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/49248/#review185355
---


Ship it!




Ship It!

- Matthew Hayes


On Sept. 11, 2017, 1:06 p.m., Eyal Allweil wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/49248/
> ---
> 
> (Updated Sept. 11, 2017, 1:06 p.m.)
> 
> 
> Review request for DataFu and Ido Hadanny.
> 
> 
> Repository: datafu
> 
> 
> Description
> ---
> 
> New UDF - TupleDiff
> 
> 
> Diffs
> -
> 
>   datafu-pig/src/main/java/datafu/pig/util/TupleDiff.java PRE-CREATION 
>   datafu-pig/src/main/resources/datafu/diff_macros.pig PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/util/TupleDiffTest.java 
> PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/49248/diff/2/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Eyal Allweil
> 
>



[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-13 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165592#comment-16165592
 ] 

Matthew Hayes commented on DATAFU-119:
--

I checked the review board.  Code looks good to me.  Can you attach the patch 
here?  Then I will merge it in.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-13 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165598#comment-16165598
 ] 

Matthew Hayes commented on DATAFU-119:
--

Also it would be useful to have a page in the guide that demonstrated the usage 
of TupleDiff and the macro.  We can do this separately.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-13 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165605#comment-16165605
 ] 

Matthew Hayes commented on DATAFU-61:
-

+1 updated patch looks good to me.  [~eyal] want to merge this in?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)