Github user kmader commented on the pull request: https://github.com/apache/spark/pull/1658#issuecomment-52219293 Addressing the major issues brought up Do we need both a stream API and a byte array one? The byte array might be more problematic with out of memory, but stream one might have issues if the streams are serialized and shuffled before being read. For reading tiff/jpg images, a byte array is a sufficient input, I guess there are other use cases as well. Finally for the file closing, I could avoid, I believe, rewriting too much code by just extending the ```close``` method of my StreamBasedRecordReader object to try again to close the stream. The NewHadoopRDD calls this method anyways using your executeOnCompleteCallbacks. I would like to leave the abstract classes BinaryRecordReader and StreamFileInputFormat public since otherwise all of the implementations have to reside in org.apache.spark which is inconvenient for external packages, but I will make the others. As for saving, I have code from my tools already but I would prefer to finish this pull request for input and then make a separate PR for saving since it is a different beast. I have created simple test cases and added them to FileSuite and JavaAPISuite respectively
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org