[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-12 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163642#comment-16163642
 ] 

Russell Jurney commented on DATAFU-61:
--

I think my testing combined with your testing is enough. Lets ship this thing.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-130:
---

 Summary: Add left outer join macro described in the DataFu guide
 Key: DATAFU-130
 URL: https://issues.apache.org/jira/browse/DATAFU-130
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil


In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

(we would obviously want to add a test for this, too)

{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162936#comment-16162936
 ] 

Eyal Allweil commented on DATAFU-128:
-

Is the documentation for updating the website accurate? There are references to 
svn in there, which lead me to think they might not be relevant anymore ...

> Add documentation for macros
> 
>
> Key: DATAFU-128
> URL: https://issues.apache.org/jira/browse/DATAFU-128
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>
> Now that it is possible to add Pig macros to Datafu, we should update the 
> documentation to reflect this, and provide guidelines and point would-be 
> contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-129:

Attachment: DATAFU-129.patch

Macro and test

> New macro - dedup
> -
>
> Key: DATAFU-129
> URL: https://issues.apache.org/jira/browse/DATAFU-129
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-129:
---

 Summary: New macro - dedup
 Key: DATAFU-129
 URL: https://issues.apache.org/jira/browse/DATAFU-129
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
ordering (typically a date updated field).

One thing to consider - the implementation relies on the 
ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
dependencies in order for the test to run. While I feel that anyone using Pig 
typically has PiggyBank in the classpath, this might not be true - do we have 
an alternative? (maybe adding it to the jarjar?)

The macro's definition looks as follows:

DEFINE dedup(relation, row_key, order_field) returns out {

relation - relation to dedup
row_key - field(s) for group by
order_field - the field for ordering (to find the most recent record)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-128:
---

 Summary: Add documentation for macros
 Key: DATAFU-128
 URL: https://issues.apache.org/jira/browse/DATAFU-128
 Project: DataFu
  Issue Type: Improvement
Reporter: Eyal Allweil


Now that it is possible to add Pig macros to Datafu, we should update the 
documentation to reflect this, and provide guidelines and point would-be 
contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-127:

Attachment: DATAFU-127.patch

Patch including new macros and tests

> New macro - samply by keys
> --
>
> Key: DATAFU-127
> URL: https://issues.apache.org/jira/browse/DATAFU-127
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-127.patch
>
>
> Two macros that return a sample of a larger table based on a list of keys, 
> with the schema of the larger table. One of the macros filters by dates, the 
> other doesn't.
> If there are multiple rows with a key that appears in the key list, all of 
> them will be returned (no deduplication is done). The results are returned 
> ordered by the key field in a single file.
> The implementation uses a replicated join for efficiency, but this means the 
> key list shouldn't be too large as to not fit in memory.
> The first macro's definition looks as follows:
> DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
> returns out {
> - table_name  - table name to sample
> - sample_set  - a set of keys
> - join_key_table  - join column name in the table
> - join_key_sample - join column name in the sample



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-127:
---

 Summary: New macro - samply by keys
 Key: DATAFU-127
 URL: https://issues.apache.org/jira/browse/DATAFU-127
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Two macros that return a sample of a larger table based on a list of keys, with 
the schema of the larger table. One of the macros filters by dates, the other 
doesn't.

If there are multiple rows with a key that appears in the key list, all of them 
will be returned (no deduplication is done). The results are returned ordered 
by the key field in a single file.

The implementation uses a replicated join for efficiency, but this means the 
key list shouldn't be too large as to not fit in memory.

The first macro's definition looks as follows:

DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
returns out {

- table_name- table name to sample
- sample_set- a set of keys
- join_key_table- join column name in the table
- join_key_sample   - join column name in the sample





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)