Sharding and joins
------------------
Key: PIG-241
URL: https://issues.apache.org/jira/browse/PIG-241
Project: Pig
Issue Type: New Feature
Components: data
Reporter: John DeTreville
Many large distributed systems for storage and computing over tables divide
these tables into smaller _shards,_ such that all rows with the same (primary)
key will appear in the same shard. If two tables are consistently sharded, then
they can be joined shard-by-shard. If corresponding shards are stored on the
same hosts (or racks), then joins can be performed locally on those hosts
without copying the rows of the tables over the network; this can produce
significant speedups.
Pig does not currently provide application-controlled sharding and the
associated shard placement and computation placement. The performance of joins
therefore suffers in many scenarios; rows are passed over the network multiple
times when performing a join. If Pig (and Hadoop) could provide the ability for
the application to shard tables consistently, according to an
application-controlled policy, joins could be completely local operations and
could in many cases perform much better.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.