[ https://issues.apache.org/jira/browse/HADOOP-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Nauroth updated HADOOP-11188: ----------------------------------- Resolution: Fixed Fix Version/s: 2.7.0 Status: Resolved (was: Patch Available) > hadoop-azure: automatically expand page blobs when they become full > ------------------------------------------------------------------- > > Key: HADOOP-11188 > URL: https://issues.apache.org/jira/browse/HADOOP-11188 > Project: Hadoop Common > Issue Type: Improvement > Components: fs > Reporter: Eric Hanson > Assignee: Eric Hanson > Fix For: 2.7.0 > > Attachments: hadoop-11188.01.patch > > > Right now, page blobs are initialized to a fixed size > (fs.azure.page.blob.size) and cannot be expanded. This task is to make them > automatically expand when they get to be nearly full. > Design: if a write occurs that does not have enough room in the file to > finish, then flush all preceding operations, extend the file, and complete > the write. This will be synchronized (to have exclusive access) in access to > PageBlobOutputStream so there won't be race conditions. > The file will be extended by fs.azure.page.blob.extension.size bytes, which > must be a multiple of 512. The internal default for > fs.azure.page.blob.extension size will be 128 * 1024 * 1024. The minimum > extension size will be 4 * 1024 * 1024 which is the maximum write size, so > the new write will finish. > Extension will stop when the file size reaches 1TB. The final extension may > be less than fs.azure.page.blob.extension.size if the remainder (1TB - > current_file_size) is smaller than fs.azure.page.blob.extension.size. > An alternative to this is to make the default size 1TB. This is much simpler > to implement. It's a one-line change. Or even simpler, don't change it at all > because it is adequate for HBase. > Rationale for this file size extension feature: > 1) be able to download files to local disk easily with CloudXplorer and > similar tools. Downloading a 1TB page blob is not practical if you don't have > 1TB disk space since on the local side it expands to the full file size, > locally filled with zeros where there is no valid data. > 2) don't make customers uncomfortable when they see large 1TB files. They > often ask if they have to pay for it, even though they only pay for the space > actually used in the page blob. > I think rationale 2 is a relatively minor issue, because 98% of customers for > HBase will never notice. They will just use it and not look at what kind of > files are used for the logs. They don't pay for the unused space, so it is > not a problem for them. We can document this. Also, if they use hadoop fs > -ls, they will see the actual size of the files since I put in a fix for that. > Rationale 1 is a minor issue because you cannot interpret the data on your > local file system anyway due to the data format. So really, the only reason > to copy data locally in its binary format would be if you are moving it > around or archiving it. Copying a 1TB page blob from one location in the > cloud to another is pretty fast with smart copy utilities that don't actually > move the 0-filled parts of the file. > Nevertheless, this is a convenience feature for users. They won't have to > worry about setting fs.azure.page.blob.size under normal circumstances and > can make the files grow as big as they want. > If we make the change to extend the file size on the fly, that introduces new > possible error or failure modes for HBase. We should included retry logic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)