Hi Apex Dev community,

Kindly provide me with feedback for the S3RecordReader design below

The design will be a merge of approaches from S3BlockReader to read blocks
from S3 and FSRecordReader to emit records from these blocks.

*S3RecordReader-for Delimited Records*
FSRecordReader uses LookAheadLineReaderContext for reading from the FS and
then searches for new line character to identify end of record.
*Existing Code snippet* (ReaderContext.java) :
*readEntity()*
*{*
*...*

*stream.read(offset + usedBytes, buffer, 0, bufferSize);*
*...*
*}*
The above code reads "buffersize" bytes starting from "offset+usedBytes"
and stores it into buffer. The buffer here represents the block read from
file system.

I intend to make the following change to the LookAheadLineReaderContext

1) Add a method readData() as follows:



*protected int readData(long usedBytes) throws IOException{      return
stream.read(offset + usedBytes, buffer, 0, bufferSize);}*
The functionality of readData() is to read a block and return the number of
bytes read (-1 if no bytes read)

2) Use above readData() method in readEntity() for
LookAheadLineReaderContext


For the S3RecordReader class, for parsing Delimited records, we will make
use of S3DelimitedRecordReaderContext which inherits from
LookAheadLineReaderContext.
We will override the readData() method in S3DelimitedRecordReaderContext
whose functionality will be to fetch a block of data from S3 similar to how
a block is fetched in S3BlockReader.

*S3RecordReader-for FixedLength Records*
In case of existing FSRecordReader, if the record length is L, we read L
bytes from the stream and emit the record. We dont have a concept of
reading a block here.
In case of S3, this approach is not feasible as data is in cloud.
Hence, in this case, we can make use of approach similar to how records are
read for Delimited Records

1) Fetch a block from S3
2) if the entire record is within the current block, read and return the
record
3) If a record spans along multiple S3 blocks, read partial record from
block 1, and remaining from block2.

For the case of reading the new block (block2), we will initialize the
offset to point to the start of new record in block2. This can be easily
handled using modulus (%).


*One concern for the above 2 approaches:*
In the case when a record R spans multiple S3 blocks (say B1 and B2), we
will have to fetch block B2 twice, once for emitting the record R and other
time for emitting records in B2 after R.
The exiting FSRecordReader in the same case also performs 2 reads of B2
from file system.

For fixed length record reading , we can suggest to the user in
doucmentation to set the block length to be a multiple of record length to
avoid above issue.


Regards,
Ajay

Reply via email to