[
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657567#action_12657567
]
Hong Tang commented on HADOOP-3315:
-----------------------------------
Some preliminary random seek performance on TFile.
Settings:
- Key size = 10-50B (zipf distribution, sigma = 1.2)
- Value size = 1-2KB (even distribution)
- TFile size = 10GB
- Compression: none and lzo (under lzo, the compression ratio is 1:2.1).
- Block size = 128KB, 256KB, 512KB, 1MB.
- File System: (1) a single disk local file system, >80% full; (2) a
single-node DFS with 4 disks (JBO), >70% full; (3) a single-node DFS with 4
disks (RAID0), almost empty.
Results:
1. Local File System, single disk, >80% full.
|| BlkSize || none (ms) || lzo (ms) ||
| 128K | 20.10 | 27.50 |
| 256K | 32.18 | 29.16 |
| 512K | 32.50 | 44.60 |
| 1M | 33.75 | 52.20 |
2. A single-node DFS with 4 disks (JBO), >70% full.
|| BlkSize || none (ms) || lzo (ms) ||
| 128K | 32.96 | 31.72 |
| 256K | 33.94 | 33.75 |
| 512K | 43.66 | 40.42 |
| 1M | 54.30 | 51.05 |
3. A single-node DFS with 4 disks (RAID0), empty
|| BlkSize || none (ms) || lzo (ms) ||
|128K | 19.70 | 20.74 |
| 256K | 20.55 | 21.26 |
| 512K | 22.21 | 24.74 |
| 1M | 24.51 | 27.08 |
Some observations:
- larger the block size, longer the seek time, because search inside a block is
linear scan.
- lzo yields longer seek time, because it pays the overhead of decompression
and needs to examine more uncompressed bytes.
- performance of lzo and none are comparable on DFS; but on local file system,
lzo is worse than none. This is because for local file system, I/O and
decompression are done sequentially, while in DFS, actual I/O is overlapped
with computation, and I/O is the critical path.
- RAID0 seems to offer significantly better random I/O throughput. However, it
is not conclusive. Other factors may also play some role: e.g. an empty file
system, or faster CPU, or network connectivity.
> New binary file format
> ----------------------
>
> Key: HADOOP-3315
> URL: https://issues.apache.org/jira/browse/HADOOP-3315
> Project: Hadoop Core
> Issue Type: New Feature
> Components: io
> Reporter: Owen O'Malley
> Assignee: Amir Youssefi
> Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch,
> HADOOP-3315_20080915_TFILE.patch, hadoop-trunk-tfile.patch, TFile
> Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs
> to compress or decompress. It would be good to have a file format that only
> needs
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.