[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16163642#comment-16163642 ] Russell Jurney commented on DATAFU-61: -- I think my testing combined with your testing is enough. Lets ship this thing. > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide
Eyal Allweil created DATAFU-130: --- Summary: Add left outer join macro described in the DataFu guide Key: DATAFU-130 URL: https://issues.apache.org/jira/browse/DATAFU-130 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } (we would obviously want to add a test for this, too) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-128) Add documentation for macros
[ https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16162936#comment-16162936 ] Eyal Allweil commented on DATAFU-128: - Is the documentation for updating the website accurate? There are references to svn in there, which lead me to think they might not be relevant anymore ... > Add documentation for macros > > > Key: DATAFU-128 > URL: https://issues.apache.org/jira/browse/DATAFU-128 > Project: DataFu > Issue Type: Improvement >Reporter: Eyal Allweil > > Now that it is possible to add Pig macros to Datafu, we should update the > documentation to reflect this, and provide guidelines and point would-be > contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-129) New macro - dedup
[ https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-129: Attachment: DATAFU-129.patch Macro and test > New macro - dedup > - > > Key: DATAFU-129 > URL: https://issues.apache.org/jira/browse/DATAFU-129 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-129.patch > > > Macro used to dedup (de-duplicate) a table, based on a key or keys and an > ordering (typically a date updated field). > One thing to consider - the implementation relies on the > ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test > dependencies in order for the test to run. While I feel that anyone using Pig > typically has PiggyBank in the classpath, this might not be true - do we have > an alternative? (maybe adding it to the jarjar?) > The macro's definition looks as follows: > DEFINE dedup(relation, row_key, order_field) returns out { > relation - relation to dedup > row_key - field(s) for group by > order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-129) New macro - dedup
Eyal Allweil created DATAFU-129: --- Summary: New macro - dedup Key: DATAFU-129 URL: https://issues.apache.org/jira/browse/DATAFU-129 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically a date updated field). One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test dependencies in order for the test to run. While I feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true - do we have an alternative? (maybe adding it to the jarjar?) The macro's definition looks as follows: DEFINE dedup(relation, row_key, order_field) returns out { relation - relation to dedup row_key - field(s) for group by order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-128) Add documentation for macros
Eyal Allweil created DATAFU-128: --- Summary: Add documentation for macros Key: DATAFU-128 URL: https://issues.apache.org/jira/browse/DATAFU-128 Project: DataFu Issue Type: Improvement Reporter: Eyal Allweil Now that it is possible to add Pig macros to Datafu, we should update the documentation to reflect this, and provide guidelines and point would-be contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-127) New macro - samply by keys
[ https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-127: Attachment: DATAFU-127.patch Patch including new macros and tests > New macro - samply by keys > -- > > Key: DATAFU-127 > URL: https://issues.apache.org/jira/browse/DATAFU-127 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-127.patch > > > Two macros that return a sample of a larger table based on a list of keys, > with the schema of the larger table. One of the macros filters by dates, the > other doesn't. > If there are multiple rows with a key that appears in the key list, all of > them will be returned (no deduplication is done). The results are returned > ordered by the key field in a single file. > The implementation uses a replicated join for efficiency, but this means the > key list shouldn't be too large as to not fit in memory. > The first macro's definition looks as follows: > DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) > returns out { > - table_name - table name to sample > - sample_set - a set of keys > - join_key_table - join column name in the table > - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-127) New macro - samply by keys
Eyal Allweil created DATAFU-127: --- Summary: New macro - samply by keys Key: DATAFU-127 URL: https://issues.apache.org/jira/browse/DATAFU-127 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Two macros that return a sample of a larger table based on a list of keys, with the schema of the larger table. One of the macros filters by dates, the other doesn't. If there are multiple rows with a key that appears in the key list, all of them will be returned (no deduplication is done). The results are returned ordered by the key field in a single file. The implementation uses a replicated join for efficiency, but this means the key list shouldn't be too large as to not fit in memory. The first macro's definition looks as follows: DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) returns out { - table_name- table name to sample - sample_set- a set of keys - join_key_table- join column name in the table - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)