martin-traverse commented on issue #794:
URL: https://github.com/apache/arrow-java/issues/794#issuecomment-3047867486

   Hm - I have found ArrowReader / ArrowWriter to be a bit opinionated. 
Particularly an issue with reader because you have less control (when bytes are 
going to arrive) but also writer makes a lot of assumptions e.g. dictionaries 
are static, no deltas etc. Neither gives you much control of what is happening 
at the file level. I ended up overriding and shading some of the supporting 
classes in order to use them (I have patches to submit which could reduce the 
need for this).
   
   I think the goal with ArrowReader / ArrowWriter was to abstract away the 
different formats (file and stream), so perhaps that abstraction is partly what 
causes the loss of control. My preference would be for a more explicit API that 
directly maps to the file structure, it is always possible to layer 
generalisation on top for a specific pattern.
   
   Here is a quick summary of what is the same / different:
   
   Writer:
   
       /// These can be the same as ArrowWriter
       void writeBatch();
       long bytesWritten();
       void close;();
   
       // These are different
       void writeHeader();  // Explicit control instead of start() / end(), 
which may or may not trigger writes
       void resetBatch(VectorSchemaRoot batch);  // Allow streaming pattern
   
   Reader:
   
       // These are the same as ArrowReader:
       VectorSchemaRoot getVectorSchemaRoot();
       long bytesRead();
       void close();
   
       // These are also the same, from DictionaryProvider
       Set<Long> getDictionaryIds();
       Dictionary lookup(long id);
   
       // These are different - Arrow has readSchema() and initialize(), which 
may or may not trigger reads
       void readHeader();   // Explicit read control
       Schema getSchema();  // Explicit get - does not trigger reading
   
       // These are aloo different - explicit control over reading
       boolean readBatch();
       boolean hasNextBatch();
       long nextBatchPosition();
       long nextBatchSize();
   
   One minor detail on naming - ArrowReader has `loadNaextBatch()` which is 
possibly more descriptive, but I think there is value in sticking with just one 
scheme - Load / save, read / write, get / put etc. otherwise it gets confusing.
   
   Happy to take a steer if you feel differently on any of this. Otherwise if 
you are happy lmk and I'll make a start :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to