Micro-benchmark to measure read/write times through InputFormats
----------------------------------------------------------------
Key: HADOOP-2406
URL: https://issues.apache.org/jira/browse/HADOOP-2406
Project: Hadoop
Issue Type: Test
Components: fs, test
Reporter: Chris Douglas
Assignee: Chris Douglas
Fix For: 0.16.0
The attached test writes/reads XGB to/from the default filesystem through
SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and
without compression, using both block and record compression for SequenceFiles.
The following results using 10GB of data through RawLocalFileSystem with 5 word
keys, 20 word values (as generated by RandomTextWriter with the same seed for
each file) are pretty stable:
Writes:
|| Format || Compression || Type || Time (sec) || Filesize (bytes) ||
| SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
| SEQ | LZO | RECORD | 367 | 11 689 969 413 |
| SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
| SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
| SEQ | | | 201 | 11 282 745 683 |
| TXT | LZO | | 742 | 12 671 065 769 |
| TXT | ZIP | | 1320 | 2 597 397 680 |
| TXT | | | 392 | 10 818 058 643 |
Reads:
|| Format || Compression || Type || Time (sec) ||
| SEQ | LZO | BLOCK | 150 |
| SEQ | LZO | RECORD | 281 |
| SEQ | ZIP | BLOCK | 155 |
| SEQ | ZIP | RECORD | 548 |
| SEQ | | | 209 |
| TXT | LZO | | 620 |
| TXT | ZIP | | 355 |
| TXT | | | 284 |
Of note:
- Lzo compressed TextOutput is larger than the uncompressed output
(HADOOP-2402); lzop cannot read it.
- Zip compression is expensive. Short values are responsible for the
unimpressive compression for record-compressed SequenceFiles.
- TextInputFormat is slow (HADOOP-2285). TextOutputFormat also looks suspect.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.