[jira] Created: (HADOOP-2406) Micro-benchmark to measure read/write times through InputFormats

Chris Douglas (JIRA) Tue, 11 Dec 2007 15:44:06 -0800

Micro-benchmark to measure read/write times through InputFormats
----------------------------------------------------------------


                 Key: HADOOP-2406
                 URL: https://issues.apache.org/jira/browse/HADOOP-2406
             Project: Hadoop
          Issue Type: Test
          Components: fs, test
            Reporter: Chris Douglas
            Assignee: Chris Douglas
             Fix For: 0.16.0


The attached test writes/reads XGB to/from the default filesystem through 
SequenceFileInputFormat and TextInputFormat, using LzoCodec, GzipCodec, and 
without compression, using both block and record compression for SequenceFiles.

The following results using 10GB of data through RawLocalFileSystem with 5 word 
keys, 20 word values (as generated by RandomTextWriter with the same seed for 
each file) are pretty stable:

Writes:
|| Format || Compression || Type || Time (sec) || Filesize (bytes) ||
| SEQ | LZO | BLOCK | 318 | 8 604 288 397 |
| SEQ | LZO | RECORD | 367 | 11 689 969 413 |
| SEQ | ZIP | BLOCK | 929 | 2 827 697 769 |
| SEQ | ZIP | RECORD | 1737 | 9 324 730 365 |
| SEQ |  |  | 201 | 11 282 745 683 |
| TXT | LZO |  | 742 | 12 671 065 769 |
| TXT | ZIP |  | 1320 | 2 597 397 680 |
| TXT |  |  | 392 | 10 818 058 643 |

Reads:
|| Format || Compression || Type || Time (sec) ||
| SEQ | LZO | BLOCK | 150 |
| SEQ | LZO | RECORD | 281 |
| SEQ | ZIP | BLOCK | 155 |
| SEQ | ZIP | RECORD | 548 |
| SEQ |  |  | 209 |
| TXT | LZO |  | 620 |
| TXT | ZIP |  | 355 |
| TXT |  |  | 284 |


Of note:
- Lzo compressed TextOutput is larger than the uncompressed output 
(HADOOP-2402); lzop cannot read it.
- Zip compression is expensive. Short values are responsible for the 
unimpressive compression for record-compressed SequenceFiles.
- TextInputFormat is slow (HADOOP-2285). TextOutputFormat also looks suspect.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-2406) Micro-benchmark to measure read/write times through InputFormats

Reply via email to