[jira] [Resolved] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-61.

Resolution: Fixed
  Assignee: Eyal Allweil

Merged.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-128) Add documentation for macros

2017-09-14 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16166594#comment-16166594
 ] 

Matthew Hayes commented on DATAFU-128:
--

Yes the documentation is accurate.  The source for the site is in our datafu 
git repo but the static content that it builds is stored in svn.

> Add documentation for macros
> 
>
> Key: DATAFU-128
> URL: https://issues.apache.org/jira/browse/DATAFU-128
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>
> Now that it is possible to add Pig macros to Datafu, we should update the 
> documentation to reflect this, and provide guidelines and point would-be 
> contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165991#comment-16165991
 ] 

Eyal Allweil commented on DATAFU-61:


Yes, I'll merge it.

I did respond to an open issue in the review request that I only just noticed, 
something about using COUNT vs. SUM when calculating the IDF part ... as far as 
I can tell, the existing code is OK but it wouldn't hurt if you or Russell want 
to take a look at it.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Review Request 27820: Setup for Macros in DataFu. Basic setup, no automated testing. Need feedback.

2017-09-14 Thread Eyal Allweil via Review Board


> On Nov. 14, 2014, 2:40 a.m., Matthew Hayes wrote:
> > datafu-pig/src/main/macros/nlp/tf_idf.pig
> > Lines 72 (patched)
> > 
> >
> > Shouldn't this be SUM?

As far as I can tell, it's OK that this is COUNT, if we're counting documents 
(and as I understand it TF-IDF we're dividing by documents for the IDF part, 
not actual occurences.


- Eyal


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/27820/#review61348
---


On Nov. 10, 2014, 8:33 p.m., Russell Jurney wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/27820/
> ---
> 
> (Updated Nov. 10, 2014, 8:33 p.m.)
> 
> 
> Review request for DataFu, pig, Joseph Adler, Jakob Homan, Matthew Hayes, and 
> Sam Shah.
> 
> 
> Repository: datafu
> 
> 
> Description
> ---
> 
> DATAFU-61 - Add TF-IDF Macro to DataFu
> 
> 
> Diffs
> -
> 
>   datafu-pig/src/main/macros/nlp/tf_idf.pig PRE-CREATION 
>   datafu-pig/src/test/macros/nlp/test_tf_idf.pig PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/27820/diff/1/
> 
> 
> Testing
> ---
> 
> Works for me, but testing not automated. See 
> https://issues.apache.org/jira/browse/DATAFU-61
> 
> 
> Thanks,
> 
> Russell Jurney
> 
>



[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165925#comment-16165925
 ] 

Eyal Allweil commented on DATAFU-119:
-

The documentation can be part of 
[DATAFU-128|https://issues.apache.org/jira/browse/DATAFU-128].

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-119:

Attachment: DATAFU-119-2.patch

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)