[ 
https://issues.apache.org/jira/browse/HADOOP-5528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688268#action_12688268
 ] 

Klaas Bosteels edited comment on HADOOP-5528 at 3/27/09 4:01 AM:
-----------------------------------------------------------------

The revised patch allows the subarray to be defined by means of Python-style 
offsets:

* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array

The best way to remember how these offsets work is by thinking of them as 
indices pointing between the array elements, with the left edge of the first 
element numbered 0, e.g.:

{code}
. +---+---+---+---+---+
  | B | B | B | B | B |
  +---+---+---+---+---+
  0   1   2   3   4   5
 -5  -4  -3  -2  -1
{code}

 The first row of numbers gives the position of the offsets 0...5 in  the 
array; the second row gives the corresponding negative offsets. When _i_ and 
_j_ are specified as left and right offset, respectively, then all bytes 
between the edges labeled _i_ and _j_ are taken into account for the 
partitioning.
 
More generally, the indexing logic can now be customized by specifying the 
{{BinaryPartitioner.Indexer}} classes to be used via the following properties:

* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}

By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte 
array is taken into account for the hashing), and the offset properties trigger 
the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement 
the indexing by means of Python-style offsets.

      was (Author: klbostee):
    The revised patch allows the subarray to be defined by means of 
Python-style offsets:

* {{mapred.binary.partitioner.left.offset}}: left Python-style offset in array
* {{mapred.binary.partitioner.right.offset}}: right Python-style offset in array

As indicated by Owen, the best way to remember how these offsets work is by 
thinking of them as indices pointing between the array elements, with the left 
edge of the first element numbered 0, e.g.:

{code}
. +---+---+---+---+---+
  | B | B | B | B | B |
  +---+---+---+---+---+
  0   1   2   3   4   5
 -5  -4  -3  -2  -1
{code}

 The first row of numbers gives the position of the offsets 0...5 in  the 
array; the second row gives the corresponding negative offsets. When _i_ and 
_j_ are specified as left and right offset, respectively, then all bytes 
between the edges labeled _i_ and _j_ are taken into account for the 
partitioning.
 
More generally, the indexing logic can now be customized by specifying the 
{{BinaryPartitioner.Indexer}} classes to be used via the following properties:

* {{mapred.binary.partitioner.left.indexer.class}}
* {{mapred.binary.partitioner.right.indexer.class}}

By default, {{FirstIndexer}} and {{LastIndexer}} are used (i.e. the whole byte 
array is taken into account for the hashing), and the offset properties trigger 
the usage of {{PosOffsetIndexer}} and/or {{NegOffsetIndexer}}, which implement 
the indexing by means of Python-style offsets.
  
> Binary partitioner
> ------------------
>
>                 Key: HADOOP-5528
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5528
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Klaas Bosteels
>            Assignee: Klaas Bosteels
>         Attachments: HADOOP-5528.patch, HADOOP-5528.patch, HADOOP-5528.patch, 
> HADOOP-5528.patch, HADOOP-5528.patch
>
>
> It would be useful to have a {{BinaryPartitioner}} that partitions 
> {{BinaryComparable}} keys by hashing a configurable part of the bytes array 
> corresponding to each key.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to