[
https://issues.apache.org/jira/browse/CRUNCH-368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13944443#comment-13944443
]
Chao Shi commented on CRUNCH-368:
---------------------------------
This patch adds one (most likely) or more bytes for the size of each field. The
down side is that this may increase the number of spills. From the number shown
in above benchmark, this trade off sounds like a win in even such a case where
the tuple is small. The improvement should be more effective when the tuple is
larger and not all fields are actually needed for comparisons.
> TupleWritable.Comparator
> ------------------------
>
> Key: CRUNCH-368
> URL: https://issues.apache.org/jira/browse/CRUNCH-368
> Project: Crunch
> Issue Type: Improvement
> Components: Core
> Affects Versions: 0.10.0, 0.8.3
> Reporter: Chao Shi
> Assignee: Chao Shi
> Attachments: crunch-368 benchmark.pdf, crunch-368.patch, gen_data.py
>
>
> This patch should improve comparison performance on TupleWritables. It saves
> the deserialization overhead. It is particularly useful when the input tuple
> are large, e.g. contains long strings.
> Please note that this changes the binary format of TupleWritable. It adds a
> var-int indicating size of field after each type code. This is a limitation
> of the writable system. We do not know the size of each field until fully
> desalinizing it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)