[ https://issues.apache.org/jira/browse/HIVE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao updated HIVE-625: ---------------------------- Attachment: HIVE-625.1.patch Some extreme test result shows there is a big performance improvement. {code} select CAST(rand() * 1024 * 1024 AS INT) as a, rand() as b from mytable cluster by a limit 10; {code} The key is an int, and the value is a double. I ran this on an example table. The mappers of the new code takes on average 98 seconds. The mappers of the old code (without this patch) takes on average 165 seconds. Although this is an extreme example, it does show the huge improvement from using the binary serialization format. Note that the test was done with gzip as mapred.map.output.compression.codec, so the difference of time is exaggerated a bit (compared with the same when we use Lzo). > Use of BinarySortableSerDe for serialization of the value between map and > reduce boundary > ----------------------------------------------------------------------------------------- > > Key: HIVE-625 > URL: https://issues.apache.org/jira/browse/HIVE-625 > Project: Hadoop Hive > Issue Type: Improvement > Components: Query Processor > Reporter: Zheng Shao > Assignee: Zheng Shao > Attachments: HIVE-625.1.patch > > > We currently use LazySimpleSerDe which serializes double to text format. > Before we have LazyBinarySerDe, we should switch to BinarySortableSerDe > because that's still much faster than LazySimpleSerDe. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.