[ https://issues.apache.org/jira/browse/TEZ-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy updated TEZ-1528: ------------------------------- Description: Increasingly, data-sets are partitioned across clusters due to legal or operational considerations. An e.g. is a 'customer-activity' table with partitions for the same 'date', but sub-partitions located in the clusters across which raw data cannot be moved/copied due to legal considerations. It would be nice to have Tez support aggregations across these clusters by providing native support for cross-cluster 'sub-dags' (think auto transform of mapper-reducer to mapper-combiner-reducer split across clusters), 'edge' with *strict* limits on data-transfer across clusters etc. Providing such a primitive would make it relatively easier for Hive, Pig etc. to provide SQL queries, ETL applications etc. across clusters. Limits on data-transfer are very important - we should support only transfer of aggregates, joins across clusters in an anti-goal. was: Increasingly, data-sets are partitioned across clusters due to legal or operational considerations. An e.g. is a 'customer-activity' table with partitions for the same 'date', but sub-partitions located in the clusters across which raw data cannot be moved/copied due to legal considerations. It would be nice to have Tez support aggregations across these clusters by providing native support for cross-cluster 'sub-dags' (think auto transform of mapper-reducer to mapper-combiner-reducer split across clusters), 'edge' with *strict* limits on data-transfer across clusters etc. Providing such a primitive would make it relatively easier for Hive, Pig etc. to provide SQL queries, ETL applications etc. across clusters. > Native support for multi cluster aggregations > --------------------------------------------- > > Key: TEZ-1528 > URL: https://issues.apache.org/jira/browse/TEZ-1528 > Project: Apache Tez > Issue Type: New Feature > Reporter: Arun C Murthy > > Increasingly, data-sets are partitioned across clusters due to legal or > operational considerations. An e.g. is a 'customer-activity' table with > partitions for the same 'date', but sub-partitions located in the clusters > across which raw data cannot be moved/copied due to legal considerations. > It would be nice to have Tez support aggregations across these clusters by > providing native support for cross-cluster 'sub-dags' (think auto transform > of mapper-reducer to mapper-combiner-reducer split across clusters), 'edge' > with *strict* limits on data-transfer across clusters etc. Providing such a > primitive would make it relatively easier for Hive, Pig etc. to provide SQL > queries, ETL applications etc. across clusters. Limits on data-transfer are > very important - we should support only transfer of aggregates, joins across > clusters in an anti-goal. -- This message was sent by Atlassian JIRA (v6.3.4#6332)