[ 
https://issues.apache.org/jira/browse/HIVE-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated HIVE-625:
----------------------------

    Attachment: HIVE-625.1.patch

Some extreme test result shows there is a big performance improvement.

{code}
  select CAST(rand() * 1024 * 1024 AS INT) as a, rand() as b from mytable 
cluster by a limit 10;
{code}

The key is an int, and the value is a double. I ran this on an example table.

The mappers of the new code takes on average 98 seconds.
The mappers of the old code (without this patch) takes on average 165 seconds.

Although this is an extreme example, it does show the huge improvement from 
using the binary serialization format.
Note that the test was done with gzip as mapred.map.output.compression.codec, 
so the difference of time is exaggerated a bit (compared with the same when we 
use Lzo).

> Use of BinarySortableSerDe for serialization of the value between map and 
> reduce boundary
> -----------------------------------------------------------------------------------------
>
>                 Key: HIVE-625
>                 URL: https://issues.apache.org/jira/browse/HIVE-625
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Query Processor
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-625.1.patch
>
>
> We currently use LazySimpleSerDe which serializes double to text format. 
> Before we have LazyBinarySerDe, we should switch to BinarySortableSerDe 
> because that's still much faster than LazySimpleSerDe.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to