[ 
https://issues.apache.org/jira/browse/AVRO-3482?focusedWorklogId=762210&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-762210
 ]

ASF GitHub Bot logged work on AVRO-3482:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 26/Apr/22 10:14
            Start Date: 26/Apr/22 10:14
    Worklog Time Spent: 10m 
      Work Description: rbalamohan commented on code in PR #1639:
URL: https://github.com/apache/avro/pull/1639#discussion_r858537526


##########
lang/java/avro/src/main/java/org/apache/avro/file/DataFileStream.java:
##########
@@ -97,18 +97,30 @@ protected DataFileStream(DatumReader<D> reader) throws 
IOException {
     this.reader = reader;
   }
 
-  /** Initialize the stream by reading from its head. */
-  void initialize(InputStream in) throws IOException {
-    this.header = new Header();
-    this.vin = DecoderFactory.get().binaryDecoder(in, vin);
+  byte[] readMagic() throws IOException {
+    if (this.vin == null) {
+      throw new IOException("InputStream is not initialized");
+    }
     byte[] magic = new byte[DataFileConstants.MAGIC.length];
     try {
       vin.readFixed(magic); // read magic
     } catch (IOException e) {
       throw new IOException("Not an Avro data file.", e);

Review Comment:
   Filename isn't available. Null check has been added as a safeguard only.





Issue Time Tracking
-------------------

    Worklog Id:     (was: 762210)
    Time Spent: 2h 50m  (was: 2h 40m)

> DataFileReader should reuse MAGIC data read from inputstream
> ------------------------------------------------------------
>
>                 Key: AVRO-3482
>                 URL: https://issues.apache.org/jira/browse/AVRO-3482
>             Project: Apache Avro
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance, pull-request-available
>          Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
>     in.seek(0);
>     int offset = 0;
>     int length = magic.length;
>     while (length > 0) {
>       int bytesRead = in.read(magic, offset, length);
>       if (bytesRead < 0)
>         throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>       length -= bytesRead;
>       offset += bytesRead;
>     }
>     in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
>     if (Arrays.equals(MAGIC, magic)) // current format
>       return new DataFileReader<>(in, reader);
>     if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>       return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to