[jira] [Updated] (TEZ-1528) Native support for multi cluster aggregations

Arun C Murthy (JIRA) Sun, 31 Aug 2014 23:55:04 -0700

     [ 
https://issues.apache.org/jira/browse/TEZ-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Arun C Murthy updated TEZ-1528:
-------------------------------
    Description: 
Increasingly, data-sets are partitioned across clusters due to legal or 
operational considerations. An e.g. is a 'customer-activity' table with 
partitions for the same 'date', but sub-partitions located in the clusters 
across which raw data cannot be moved/copied due to legal considerations.

It would be nice to have Tez support aggregations across these clusters by 
providing native support for cross-cluster 'sub-dags' (think auto transform of 
mapper-reducer to mapper-combiner-reducer split across clusters),  'edge' with 
*strict* limits on data-transfer across clusters etc. Providing such a 
primitive would make it relatively easier for Hive, Pig etc. to provide SQL 
queries, ETL applications etc. across clusters. Limits on data-transfer are 
very important - we should support only transfer of aggregates, joins across 
clusters in an anti-goal.

  was:
Increasingly, data-sets are partitioned across clusters due to legal or 
operational considerations. An e.g. is a 'customer-activity' table with 
partitions for the same 'date', but sub-partitions located in the clusters 
across which raw data cannot be moved/copied due to legal considerations.

It would be nice to have Tez support aggregations across these clusters by 
providing native support for cross-cluster 'sub-dags' (think auto transform of 
mapper-reducer to mapper-combiner-reducer split across clusters),  'edge' with 
*strict* limits on data-transfer across clusters etc. Providing such a 
primitive would make it relatively easier for Hive, Pig etc. to provide SQL 
queries, ETL applications etc. across clusters.


> Native support for multi cluster aggregations
> ---------------------------------------------
>
>                 Key: TEZ-1528
>                 URL: https://issues.apache.org/jira/browse/TEZ-1528
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>
> Increasingly, data-sets are partitioned across clusters due to legal or 
> operational considerations. An e.g. is a 'customer-activity' table with 
> partitions for the same 'date', but sub-partitions located in the clusters 
> across which raw data cannot be moved/copied due to legal considerations.
> It would be nice to have Tez support aggregations across these clusters by 
> providing native support for cross-cluster 'sub-dags' (think auto transform 
> of mapper-reducer to mapper-combiner-reducer split across clusters),  'edge' 
> with *strict* limits on data-transfer across clusters etc. Providing such a 
> primitive would make it relatively easier for Hive, Pig etc. to provide SQL 
> queries, ETL applications etc. across clusters. Limits on data-transfer are 
> very important - we should support only transfer of aggregates, joins across 
> clusters in an anti-goal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TEZ-1528) Native support for multi cluster aggregations

Reply via email to