[jira] [Assigned] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

2022-04-29 Thread Thiruvalluvan M. G. (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thiruvalluvan M. G. reassigned AVRO-3482:
-

Assignee: Rajesh Balamohan  (was: Thiruvalluvan M. G.)

> DataFileReader should reuse MAGIC data read from inputstream
> 
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Rajesh Balamohan
>Priority: Major
>  Labels: performance, pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
>   int bytesRead = in.read(magic, offset, length);
>   if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>   length -= bytesRead;
>   offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
>   return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>   return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (AVRO-3482) DataFileReader should reuse MAGIC data read from inputstream

2022-04-27 Thread Martin Tzvetanov Grigorov (Jira)


 [ 
https://issues.apache.org/jira/browse/AVRO-3482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Tzvetanov Grigorov reassigned AVRO-3482:
---

Assignee: Thiruvalluvan M. G.

> DataFileReader should reuse MAGIC data read from inputstream
> 
>
> Key: AVRO-3482
> URL: https://issues.apache.org/jira/browse/AVRO-3482
> Project: Apache Avro
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Thiruvalluvan M. G.
>Priority: Major
>  Labels: performance, pull-request-available
> Fix For: 1.12.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> [https://github.com/apache/avro/blob/master/lang/java/avro/src/main/java/org/apache/avro/file/DataFileReader.java#L60-L72]
>  
> {code}
> byte[] magic = new byte[MAGIC.length];
> in.seek(0);
> int offset = 0;
> int length = magic.length;
> while (length > 0) {
>   int bytesRead = in.read(magic, offset, length);
>   if (bytesRead < 0)
> throw new EOFException("Unexpected EOF with " + length + " bytes 
> remaining to read");
>   length -= bytesRead;
>   offset += bytesRead;
> }
> in.seek(0); <--- This will force the inputstream to switch to "random" io 
> policy in next read in cloud connectors!
> if (Arrays.equals(MAGIC, magic)) // current format
>   return new DataFileReader<>(in, reader);
> if (Arrays.equals(DataFileReader12.MAGIC, magic)) // 1.2 format
>   return new DataFileReader12<>(in, reader);
>  
> {code}
>  
> With cloud stores, this can turn out to be expensive as the stream has to be 
> closed and reopened in cloud connectors (e.g s3).
> It will be helpful to reuse the MAGIC bytes read from inputstream and pass it 
> on to DataFileReader / DataFileReader12. This will ensure that, file can be 
> read in sequential manner in cloud stores and help in reducing IO calls.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)