[ https://issues.apache.org/jira/browse/AVRO-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529213#comment-17529213 ]
Martin Tzvetanov Grigorov commented on AVRO-3482: ------------------------------------------------- Does this improvement need to be downported to branch-1.11 ? > DataFileReader should reuse MAGIC data read from inputstream > ------------------------------------------------------------ > > Key: AVRO-3482 > URL: https://issues.apache.org/jira/browse/AVRO-3482 > Project: Apache Avro > Issue Type: Bug > Reporter: Rajesh Balamohan > Assignee: Thiruvalluvan M. G. > Priority: Major > Labels: performance, pull-request-available > Fix For: 1.12.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72] > > {code} > byte[] magic = new byte[MAGIC.length]; > in.seek(0); > int offset = 0; > int length = magic.length; > while (length > 0) { > int bytesRead = in.read(magic, offset, length); > if (bytesRead < 0) > throw new EOFException("Unexpected EOF with " + length + " bytes > remaining to read"); > length -= bytesRead; > offset += bytesRead; > } > in.seek(0); <--- This will force the inputstream to switch to "random" io > policy in next read in cloud connectors! > if (Arrays.equals(MAGIC, magic)) // current format > return new DataFileReader<>(in, reader); > if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format > return new DataFileReader12<>(in, reader); > > {code} > > With cloud stores, this can turn out to be expensive as the stream has to be > closed and reopened in cloud connectors (e.g s3). > It will be helpful to reuse the MAGIC bytes read from inputstream and pass it > on to DataFileReader / DataFileReader12. This will ensure that, file can be > read in sequential manner in cloud stores and help in reducing IO calls. -- This message was sent by Atlassian Jira (v8.20.7#820007)