[
https://issues.apache.org/jira/browse/PIG-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ying He updated PIG-480:
------------------------
Attachment: PIG_480.patch
patch to use identity map.
An IdentityMapOptimizer is applied when a MR plan contains at least 2 MRs. It
evaluates each MR job, if its reducer uses POStore to dump a tmp file, and the
mapper of next MR only contains a POLocalRearrange and a POLoad to load the tmp
file, then the POLocalRearrange of next mapper is moved up to the reducer of
this MR, and the mapper of next MR job is changed to use identity map.
In this case, the reducer of the MR job output (key, tuple) pairs to the tmp
file by using a different OutputFormat, PigBinaryValueOutputFormat. It uses a
different record writer to dump data, the format is
delimiter (3 bytes,, 0x01, 0x02, 0x03)
key
length of byte[] for tuple
byte[] for tuple
the next MR job that uses identity map uses a different InputFormat,
PigBinaryValueInputFormat, which returns a different RecordReader, to read in
data as (key, tuple) pairs. But the tuple is kept in byte[] form. The identity
map does nothing except passing the (key, tuple) through and writing them to
disk. When reducer picks them up, the tuple is de-serialized for processing.
The reason of doing this is performance. Because the tuple reading in and
writing out of identity map are in byte[] form, we saved a de-serialization and
serialization of tuples in mapper.
A use case is following:
a = load 'f' as (id, v);
b = load 's' as (id, v);
c = join a by id, b by id;
d = group c by a::id;
dump d;
this example contains 2 MR jobs. After optimization, the first job output
(key, tuple) pairs, and second job uses identity map.
> PERFORMANCE: Use identity mapper in a chain of M-R jobs
> -------------------------------------------------------
>
> Key: PIG-480
> URL: https://issues.apache.org/jira/browse/PIG-480
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.2.0
> Reporter: Olga Natkovich
> Attachments: PIG_480.patch
>
>
> For jobs with two or more MR jobs, use identity mapper wherever possible in
> second and subsequent MR jobs. Identity mapper is about 50% than pig empty
> map job because it doesn't parse the data.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.