[ https://issues.apache.org/jira/browse/HDFS-16147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17393310#comment-17393310 ]
Stephen O'Donnell edited comment on HDFS-16147 at 8/4/21, 4:44 PM: ------------------------------------------------------------------- When the image is saved, there is a single stream written out serially. To enable parallel load on the image, index entries are added for the sub-sections as they are written out. This means we have a single stream, with the position of the sub-sections saved. That means, when we load the image, there are two choices: 1. We start at the beginning of a section and open a stream and read the entire section. 2. We open several streams, reading each sub-section in parallel by jumping to the indexed sub-section position and read the given length. When you enable compression too, this means the entire stream is compressed from end to end as a single compressed stream. I wrongly thought there would be many compressed streams within the image file, and that is why I though OIV etc would have trouble reading this. So it makes sense OIV can read the image serially, and the namenode can also read the image with parallel disabled when compression is on. The surprise to me, is that we can load the image in parallel, as that involves jumping into the compressed stream somewhere in the middle and starting to read, which most compression codecs do not support. It was my belief that gzip does not support this. However, looking at the existing code, before this change, I see that we jump around in the stream already: {code} for (FileSummary.Section s : sections) { channel.position(s.getOffset()); InputStream in = new BufferedInputStream(new LimitInputStream(fin, s.getLength())); in = FSImageUtil.wrapInputStreamForCompression(conf, summary.getCodec(), in); {code} So that must mean the compression codecs are splitable somehow, and they can start decompressing from an random position in the stream. Due to this, if the image is compressed, the existing parallel code can be mostly reused to load the sub-sections within the compressed stream. >From the above, could we allow parallel loading of compressed images by simply >removing the code which disallows it? {code} - if (loadInParallel) { - if (compressionEnabled) { - LOG.warn("Parallel Image loading and saving is not supported when {}" + - " is set to true. Parallel will be disabled.", - DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY); - loadInParallel = false; - } - } {code} Then let the image save compressed with the sub-sections indexed and try to load it? was (Author: sodonnell): When the image is saved, there is a single stream written out serially. To enable parallel load on the image, index entries are added for the sub-sections as they are written out. This means we have a single stream, with the position of the sub-sections saved. That means, when we load the image, there are two choices: 1. We start at the beginning of a section and open a stream and read the entire section. 2. We open several streams at by jumping to that position and read the given length. When you enabled compression too, this means the entire stream is compressed from end to end as a single compressed stream. I wrongly thought there would be many compressed streams within the image file, and that is why I though OIV etc would have trouble reading this. So it makes sense OIV can read the image serially, and the namenode can also read the image with parallel disabled when compression is on. The surprise to me, is that we can load the image in parallel, as that involves jumping into the compressed stream somewhere in the middle and starting to read, which more compression codecs do not support. It was my belief that gzip does not support this. However, looking at the existing code, before this change, I see that we jump around in the stream already: {code} for (FileSummary.Section s : sections) { channel.position(s.getOffset()); InputStream in = new BufferedInputStream(new LimitInputStream(fin, s.getLength())); in = FSImageUtil.wrapInputStreamForCompression(conf, summary.getCodec(), in); {code} So that must mean the compression codecs are splittable somehow, and they are start decompressing from an random position in the stream. Due to this, if the image is compressed, the existing parallel code can be mostly reused to load the sub-sections within the compressed stream. >From the above, could we allow parallel loading of compressed images by simply >removing the code: {code} - if (loadInParallel) { - if (compressionEnabled) { - LOG.warn("Parallel Image loading and saving is not supported when {}" + - " is set to true. Parallel will be disabled.", - DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY); - loadInParallel = false; - } - } {code} Then let the image save compressed with the sub-sections indexed and try to load it? > load fsimage with parallelization and compression > ------------------------------------------------- > > Key: HDFS-16147 > URL: https://issues.apache.org/jira/browse/HDFS-16147 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namanode > Affects Versions: 3.3.0 > Reporter: liuyongpan > Priority: Minor > Attachments: HDFS-16147.001.patch, HDFS-16147.002.patch, > subsection.svg > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org