[ https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16409458#comment-16409458 ]
Eyal Allweil commented on DATAFU-127: ------------------------------------- The reason I used (and prefer) _sample_ over _filter_ is that this macro really is meant for making a sample. After all, filtering by keys it's basically just doing an inner join, you don't need a macro for that. Also, the macro preserves all the fields that exist in the original, which is usually a waste. We use it for preparing samples of large tables to use afterwards in development or CI - which fields will be used isn't necessarily known or going to remain the same over time. > New macro - samply by keys > -------------------------- > > Key: DATAFU-127 > URL: https://issues.apache.org/jira/browse/DATAFU-127 > Project: DataFu > Issue Type: New Feature > Reporter: Eyal Allweil > Assignee: Eyal Allweil > Priority: Major > Labels: macro > Attachments: DATAFU-127.patch > > > Two macros that return a sample of a larger table based on a list of keys, > with the schema of the larger table. One of the macros filters by dates, the > other doesn't. > If there are multiple rows with a key that appears in the key list, all of > them will be returned (no deduplication is done). The results are returned > ordered by the key field in a single file. > The implementation uses a replicated join for efficiency, but this means the > key list shouldn't be too large as to not fit in memory. > The first macro's definition looks as follows: > DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) > returns out { > - table_name - table name to sample > - sample_set - a set of keys > - join_key_table - join column name in the table > - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v7.6.3#76005)