Rohini Palaniswamy created PIG-4652:
---------------------------------------
Summary: [Pig on Tez] Group by on multiple keys is slower than
mapreduce
Key: PIG-4652
URL: https://issues.apache.org/jira/browse/PIG-4652
Project: Pig
Issue Type: Bug
Reporter: Rohini Palaniswamy
Fix For: 0.16.0
Tez is using PigTupleSortComparator on both map and reduce side and in
POShuffleTezLoad. Mapreduce is using PigTupleWritableComparator on the map and
reduce side for comparing tuples which is byte only comparison and very fast.
It then uses PigGrouping<DataType>WritableComparator as the grouping comparator
to correctly group those keys.
It is not possible to use similar method in Tez (PigTupleWritableComparator
for output and input and PigTupleSortComparator in POShuffleTezLoad), without
addition of APIs in Tez to get raw bytes of the keys. Because when we compare
multiple inputs for min key in POShuffleTezLoad, there raw bytes need to be
compared to maintain the same order as the mapside. In mapreduce, there was
only single input and mapreduce framework sorted them together. But in Tez, the
join inputs are sorted separately and the application only gets the serialized
key. Need APIs in Tez KeyValuesReader to get the bytes of the current key as
well which can be used in POShuffleTezLoad for min key comparison.
But the majority of the slowness of PigTupleSortComparator seems to be coming
from inefficiency of String comparison in BinInterSedesTupleRawComparator which
initializes String instead of comparing bytes like Text.Comparator.
{code}
str1 = new String(bb1.array(), bb1.position(), casz1, BinInterSedes.UTF8);
str2 = new String(bb2.array(), bb2.position(), casz2, BinInterSedes.UTF8);
{code}
Fixing that should make performance very close to mapreduce with negligible
difference. But following mapreduce like model, should make it even more
efficient.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)