[ 
https://issues.apache.org/jira/browse/HADOOP-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated HADOOP-2406:
----------------------------------

    Status: Patch Available  (was: Open)

> Micro-benchmark to measure read/write times through InputFormats
> ----------------------------------------------------------------
>
>                 Key: HADOOP-2406
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2406
>             Project: Hadoop
>          Issue Type: Test
>          Components: fs, test
>            Reporter: Chris Douglas
>            Assignee: Chris Douglas
>             Fix For: 0.16.0
>
>         Attachments: 2406-0.patch, 2406-1.patch
>
>
> The attached test writes/reads XGB to/from the default filesystem through 
> SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and 
> without compression, using both block and record compression for 
> SequenceFiles.
> The following results using 10GB of data through RawLocalFileSystem with 5 
> word keys, 20 word values (as generated by RandomTextWriter with the same 
> seed for each file) are pretty stable:
> Writes:
> || Format || Compression || Type || Time (sec) || Filesize (bytes) ||
> | SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
> | SEQ | LZO | RECORD | 367 | 11 689 969 413 |
> | SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
> | SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
> | SEQ |  |  | 201 | 11 282 745 683 |
> | TXT | LZO |  | 742 | 12 671 065 769 |
> | TXT | ZIP |  | 1320 | 2 597 397 680 |
> | TXT |  |  | 392 | 10 818 058 643 |
> Reads:
> || Format || Compression || Type || Time (sec) ||
> | SEQ | LZO | BLOCK | 150 |
> | SEQ | LZO | RECORD | 281 |
> | SEQ | ZIP | BLOCK | 155 |
> | SEQ | ZIP | RECORD | 548 |
> | SEQ |  |  | 209 |
> | TXT | LZO |  | 620 |
> | TXT | ZIP |  | 355 |
> | TXT |  |  | 284 |
> Of note:
> - Lzo compressed TextOutput is larger than the uncompressed output 
> (HADOOP-2402); lzop cannot read it.
> - Zip compression is expensive. Short values are responsible for the 
> unimpressive compression for record-compressed SequenceFiles.
> - TextInputFormat is slow (HADOOP-2285). TextOutputFormat also looks suspect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to