[ https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=762210&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762210 ]
ASF GitHub Bot logged work on AVRO-3482: ---------------------------------------- Author: ASF GitHub Bot Created on: 26/Apr/22 10:14 Start Date: 26/Apr/22 10:14 Worklog Time Spent: 10m Work Description: rbalamohan commented on code in PR #1639: URL: https://github.com/apache/avro/pull/1639#discussion_r858537526 ########## lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java: ########## @@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws IOException { this.reader = reader; } - /** Initialize the stream by reading from its head. */ - void initialize(InputStream in) throws IOException { - this.header = new Header(); - this.vin = DecoderFactory.get().binaryDecoder(in, vin); + byte[] readMagic() throws IOException { + if (this.vin == null) { + throw new IOException("InputStream is not initialized"); + } byte[] magic = new byte[DataFileConstants.MAGIC.length]; try { vin.readFixed(magic); // read magic } catch (IOException e) { throw new IOException("Not an Avro data file.", e); Review Comment: Filename isn't available. Null check has been added as a safeguard only. Issue Time Tracking ------------------- Worklog Id: (was: 762210) Time Spent: 2h 50m (was: 2h 40m) > DataFileReader should reuse MAGIC data read from inputstream > ------------------------------------------------------------ > > Key: AVRO-3482 > URL: https://issues.apache.org/jira/browse/AVRO-3482 > Project: Apache Avro > Issue Type: Bug > Reporter: Rajesh Balamohan > Priority: Major > Labels: performance, pull-request-available > Time Spent: 2h 50m > Remaining Estimate: 0h > > [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72] > > {code} > byte[] magic = new byte[MAGIC.length]; > in.seek(0); > int offset = 0; > int length = magic.length; > while (length > 0) { > int bytesRead = in.read(magic, offset, length); > if (bytesRead < 0) > throw new EOFException("Unexpected EOF with " + length + " bytes > remaining to read"); > length -= bytesRead; > offset += bytesRead; > } > in.seek(0); <--- This will force the inputstream to switch to "random" io > policy in next read in cloud connectors! > if (Arrays.equals(MAGIC, magic)) // current format > return new DataFileReader<>(in, reader); > if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format > return new DataFileReader12<>(in, reader); > > {code} > > With cloud stores, this can turn out to be expensive as the stream has to be > closed and reopened in cloud connectors (e.g s3). > It will be helpful to reuse the MAGIC bytes read from inputstream and pass it > on to DataFileReader / DataFileReader12. This will ensure that, file can be > read in sequential manner in cloud stores and help in reducing IO calls. -- This message was sent by Atlassian Jira (v8.20.7#820007)