[jira] Commented: (HADOOP-3315) New binary file format

Hong Tang (JIRA) Wed, 17 Dec 2008 13:35:09 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-3315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12657567#action_12657567
 ]


Hong Tang commented on HADOOP-3315:
-----------------------------------

Some preliminary random seek performance on TFile.

Settings:
- Key size = 10-50B (zipf distribution, sigma = 1.2)
- Value size = 1-2KB (even distribution)
- TFile size = 10GB
- Compression: none and lzo (under lzo, the compression ratio is 1:2.1).
- Block size = 128KB, 256KB, 512KB, 1MB.
- File System: (1) a single disk local file system, >80% full; (2) a 
single-node DFS with 4 disks (JBO), >70% full; (3) a single-node DFS with 4 
disks (RAID0), almost empty.

Results:
1. Local File System, single disk, >80% full.
|| BlkSize || none (ms) || lzo (ms) ||
| 128K | 20.10 | 27.50 |
| 256K | 32.18 | 29.16 |
| 512K | 32.50 | 44.60 |
| 1M | 33.75 |  52.20 |

2. A single-node DFS with 4 disks (JBO), >70% full.
|| BlkSize || none (ms) || lzo (ms) ||
| 128K | 32.96 | 31.72 |
| 256K | 33.94 | 33.75 |
| 512K | 43.66 | 40.42 |
| 1M | 54.30 | 51.05 |

3. A single-node DFS with 4 disks (RAID0), empty
|| BlkSize || none (ms) || lzo (ms) ||
|128K | 19.70 | 20.74 |
| 256K | 20.55 | 21.26 |
| 512K | 22.21 | 24.74 |
| 1M | 24.51 | 27.08 |

Some observations:
- larger the block size, longer the seek time, because search inside a block is 
linear scan.
- lzo yields longer seek time, because it pays the overhead of decompression 
and needs to examine more uncompressed bytes.
- performance of lzo and none are comparable on DFS; but on local file system, 
lzo is worse than none. This is because for local file system, I/O and 
decompression are done sequentially, while in DFS, actual I/O is overlapped 
with computation, and I/O is the critical path.
- RAID0 seems to offer significantly better random I/O throughput. However, it 
is not conclusive. Other factors may also play some role: e.g. an empty file 
system, or faster CPU, or network connectivity.

> New binary file format
> ----------------------
>
>                 Key: HADOOP-3315
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3315
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: io
>            Reporter: Owen O'Malley
>            Assignee: Amir Youssefi
>         Attachments: HADOOP-3315_20080908_TFILE_PREVIEW_WITH_LZO_TESTS.patch, 
> HADOOP-3315_20080915_TFILE.patch, hadoop-trunk-tfile.patch, TFile 
> Specification 20081217.pdf
>
>
> SequenceFile's block compression format is too complex and requires 4 codecs 
> to compress or decompress. It would be good to have a file format that only 
> needs 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3315) New binary file format

Reply via email to