martin-traverse commented on issue #794: URL: https://github.com/apache/arrow-java/issues/794#issuecomment-3047867486
Hm - I have found ArrowReader / ArrowWriter to be a bit opinionated. Particularly an issue with reader because you have less control (when bytes are going to arrive) but also writer makes a lot of assumptions e.g. dictionaries are static, no deltas etc. Neither gives you much control of what is happening at the file level. I ended up overriding and shading some of the supporting classes in order to use them (I have patches to submit which could reduce the need for this). I think the goal with ArrowReader / ArrowWriter was to abstract away the different formats (file and stream), so perhaps that abstraction is partly what causes the loss of control. My preference would be for a more explicit API that directly maps to the file structure, it is always possible to layer generalisation on top for a specific pattern. Here is a quick summary of what is the same / different: Writer: /// These can be the same as ArrowWriter void writeBatch(); long bytesWritten(); void close;(); // These are different void writeHeader(); // Explicit control instead of start() / end(), which may or may not trigger writes void resetBatch(VectorSchemaRoot batch); // Allow streaming pattern Reader: // These are the same as ArrowReader: VectorSchemaRoot getVectorSchemaRoot(); long bytesRead(); void close(); // These are also the same, from DictionaryProvider Set<Long> getDictionaryIds(); Dictionary lookup(long id); // These are different - Arrow has readSchema() and initialize(), which may or may not trigger reads void readHeader(); // Explicit read control Schema getSchema(); // Explicit get - does not trigger reading // These are aloo different - explicit control over reading boolean readBatch(); boolean hasNextBatch(); long nextBatchPosition(); long nextBatchSize(); One minor detail on naming - ArrowReader has `loadNaextBatch()` which is possibly more descriptive, but I think there is value in sticking with just one scheme - Load / save, read / write, get / put etc. otherwise it gets confusing. Happy to take a steer if you feel differently on any of this. Otherwise if you are happy lmk and I'll make a start :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org