[
https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688268#action_12688268
]
Klaas Bosteels edited comment on HADOOP-5528 at 3/27/09 4:01 AM:
-----------------------------------------------------------------
The revised patch allows the subarray to be defined by means of Python-style
offsets:
* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array
The best way to remember how these offsets work is by thinking of them as
indices pointing between the array elements, with the left edge of the first
element numbered 0, e.g.:
{code}
. +---+---+---+---+---+
| B | B | B | B | B |
+---+---+---+---+---+
0 1 2 3 4 5
-5 -4 -3 -2 -1
{code}
The first row of numbers gives the position of the offsets 0...5 in the
array; the second row gives the corresponding negative offsets. When _i_ and
_j_ are specified as left and right offset, respectively, then all bytes
between the edges labeled _i_ and _j_ are taken into account for the
partitioning.
More generally, the indexing logic can now be customized by specifying the
{{BinaryPartitioner.Indexer}} classes to be used via the following properties:
* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}
By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte
array is taken into account for the hashing), and the offset properties trigger
the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement
the indexing by means of Python-style offsets.
was (Author: klbostee):
The revised patch allows the subarray to be defined by means of
Python-style offsets:
* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array
As indicated by Owen, the best way to remember how these offsets work is by
thinking of them as indices pointing between the array elements, with the left
edge of the first element numbered 0, e.g.:
{code}
. +---+---+---+---+---+
| B | B | B | B | B |
+---+---+---+---+---+
0 1 2 3 4 5
-5 -4 -3 -2 -1
{code}
The first row of numbers gives the position of the offsets 0...5 in the
array; the second row gives the corresponding negative offsets. When _i_ and
_j_ are specified as left and right offset, respectively, then all bytes
between the edges labeled _i_ and _j_ are taken into account for the
partitioning.
More generally, the indexing logic can now be customized by specifying the
{{BinaryPartitioner.Indexer}} classes to be used via the following properties:
* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}
By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte
array is taken into account for the hashing), and the offset properties trigger
the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement
the indexing by means of Python-style offsets.
> Binary partitioner
> ------------------
>
> Key: HADOOP-5528
> URL: https://issues.apache.org/jira/browse/HADOOP-5528
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Klaas Bosteels
> Assignee: Klaas Bosteels
> Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch,
> HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions
> {{BinaryComparable}} keys by hashing a configurable part of the bytes array
> corresponding to each key.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.