[jira] [Created] (HBASE-28100) The size that is checked against the maxfilesize threshold is the uncompressed size of the HFile
alan.zhao created HBASE-28100: - Summary: The size that is checked against the maxfilesize threshold is the uncompressed size of the HFile Key: HBASE-28100 URL: https://issues.apache.org/jira/browse/HBASE-28100 Project: HBase Issue Type: Bug Environment: HBase 2.x Reporter: alan.zhao Assignee: alan.zhao Attachments: image-2023-09-20-12-09-49-959.png HBase server is configured to use Snappy compression.when doing bulkload in HBase, the size that is checked against the maxfilesize threshold is the uncompressed size of the HFile, not the compressed size. HFileOutputFormat2.class {code:java} //代码占位符 new RecordWriter() { @Override public void write(ImmutableBytesWritable row, V cell) throws IOException { ... } } {code} !image-2023-09-20-12-09-49-959.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27733) hfile split occurs during bulkload, the new HFile file does not specify favored nodes
alan.zhao created HBASE-27733: - Summary: hfile split occurs during bulkload, the new HFile file does not specify favored nodes Key: HBASE-27733 URL: https://issues.apache.org/jira/browse/HBASE-27733 Project: HBase Issue Type: Improvement Reporter: alan.zhao Assignee: alan.zhao ## BulkloadHFilesTool.class /** * Copy half of an HFile into a new HFile. */ private static void copyHFileHalf(Configuration conf, Path inFile, Path outFile, Reference reference, ColumnFamilyDescriptor familyDescriptor) throws IOException { FileSystem fs = inFile.getFileSystem(conf); CacheConfig cacheConf = CacheConfig.DISABLED; HalfStoreFileReader halfReader = null; StoreFileWriter halfWriter = null; try { ReaderContext context = new ReaderContextBuilder().withFileSystemAndPath(fs, inFile).build(); StoreFileInfo storeFileInfo = new StoreFileInfo(conf, fs, fs.getFileStatus(inFile), reference); storeFileInfo.initHFileInfo(context); halfReader = (HalfStoreFileReader) storeFileInfo.createReader(context, cacheConf); storeFileInfo.getHFileInfo().initMetaAndIndex(halfReader.getHFileReader()); Map fileInfo = halfReader.loadFileInfo(); int blocksize = familyDescriptor.getBlocksize(); Algorithm compression = familyDescriptor.getCompressionType(); BloomType bloomFilterType = familyDescriptor.getBloomFilterType(); HFileContext hFileContext = new HFileContextBuilder().withCompression(compression) .withChecksumType(StoreUtils.getChecksumType(conf)) .withBytesPerCheckSum(StoreUtils.getBytesPerChecksum(conf)).withBlockSize(blocksize) .withDataBlockEncoding(familyDescriptor.getDataBlockEncoding()).withIncludesTags(true) .withCreateTime(EnvironmentEdgeManager.currentTime()).build(); *halfWriter = new StoreFileWriter.Builder(conf, cacheConf, fs).withFilePath(outFile)* *.withBloomType(bloomFilterType).withFileContext(hFileContext).build();* HFileScanner scanner = halfReader.getScanner(false, false, false); scanner.seekTo(); ... When hfile splitting occurs during bulkload, the new HFile file does not specify favored nodes, which will affect the locality of data. Internally, we implemented a version of the code that allows us to specify the favored nodes of the split HFile in copyHFileHalf() to avoid compromising locality -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27688) HFile splitting occurs during bulkload, the CREATE_TIME_TS of hfileinfo is 0
alan.zhao created HBASE-27688: - Summary: HFile splitting occurs during bulkload, the CREATE_TIME_TS of hfileinfo is 0 Key: HBASE-27688 URL: https://issues.apache.org/jira/browse/HBASE-27688 Project: HBase Issue Type: Bug Reporter: alan.zhao If HFile splitting occurs during bulkload, the CREATE_TIME_TS of hfileinfo =0,When HFile is copied after splitting, CREATE_TIME_TS of the original file is not copied。 {code:java} ##BulkLoadHFilesTool.class /** * Copy half of an HFile into a new HFile. */ private static void copyHFileHalf(Configuration conf, Path inFile, Path outFile, Reference reference, ColumnFamilyDescriptor familyDescriptor) throws IOException { FileSystem fs = inFile.getFileSystem(conf); CacheConfig cacheConf = CacheConfig.DISABLED; HalfStoreFileReader halfReader = null; StoreFileWriter halfWriter = null; try { 。。。 HFileContext hFileContext = new HFileContextBuilder().withCompression(compression) .withChecksumType(StoreUtils.getChecksumType(conf)) .withBytesPerCheckSum(StoreUtils.getBytesPerChecksum(conf)).withBlockSize(blocksize) .withDataBlockEncoding(familyDescriptor.getDataBlockEncoding()).withIncludesTags(true) .build(); // TODO .withCreateTime(EnvironmentEdgeManager.currentTime()) halfWriter = new StoreFileWriter.Builder(conf, cacheConf, fs).withFilePath(outFile) .withBloomType(bloomFilterType).withFileContext(hFileContext).build(); HFileScanner scanner = halfReader.getScanner(false, false, false); scanner.seekTo(); do { halfWriter.append(scanner.getCell()); } while (scanner.next()); for (Map.Entry entry : fileInfo.entrySet()) { if (shouldCopyHFileMetaKey(entry.getKey())) { halfWriter.appendFileInfo(entry.getKey(), entry.getValue()); } } } finally { 。。。 } } ##get lastMajorCompactionTs metric lastMajorCompactionTs = this.region.getOldestHfileTs(true); ... long now = EnvironmentEdgeManager.currentTime(); return now - lastMajorCompactionTs; ... ## public long getOldestHfileTs(boolean majorCompactionOnly) throws IOException { long result = Long.MAX_VALUE; for (HStore store : stores.values()) { Collection storeFiles = store.getStorefiles(); ... for (HStoreFile file : storeFiles) { StoreFileReader sfReader = file.getReader(); ... result = Math.min(result, reader.getFileContext().getFileCreateTime()); } } return result == Long.MAX_VALUE ? 0 : result; }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-27670) The FSDataOutputStream is obtained without reflection mode
alan.zhao created HBASE-27670: - Summary: The FSDataOutputStream is obtained without reflection mode Key: HBASE-27670 URL: https://issues.apache.org/jira/browse/HBASE-27670 Project: HBase Issue Type: Improvement Environment: HBase version: 2.2.3 Reporter: alan.zhao hbase interacts with hdfs and obtains FSDataOutputStream to generate HFiles. In order to support favoredNodes, reflection is used. The DistributedFileSystem has a more direct way to get the FSDataOutputStream,for example:dfs.createFile(path).permission(perm).create()...; this API allows you to set various parameters, including favoredNodes. I think avoiding reflection can improve performance, and if you agree with me, I can optimize this part of the code; Model:hbase-server class:FSUtils {code:java} public static FSDataOutputStream create(Configuration conf, FileSystem fs, Path path, FsPermission perm, InetSocketAddress[] favoredNodes) throws IOException { if (fs instanceof HFileSystem) { FileSystem backingFs = ((HFileSystem) fs).getBackingFs(); if (backingFs instanceof DistributedFileSystem) { // Try to use the favoredNodes version via reflection to allow backwards- // compatibility. short replication = Short.parseShort(conf.get(ColumnFamilyDescriptorBuilder.DFS_REPLICATION, String.valueOf(ColumnFamilyDescriptorBuilder.DEFAULT_DFS_REPLICATION))); try { return (FSDataOutputStream) (DistributedFileSystem.class .getDeclaredMethod("create", Path.class, FsPermission.class, boolean.class, int.class, short.class, long.class, Progressable.class, InetSocketAddress[].class) .invoke(backingFs, path, perm, true, CommonFSUtils.getDefaultBufferSize(backingFs), replication > 0 ? replication : CommonFSUtils.getDefaultReplication(backingFs, path), CommonFSUtils.getDefaultBlockSize(backingFs, path), null, favoredNodes));{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)