Github user kmader commented on the pull request:

    https://github.com/apache/spark/pull/1658#issuecomment-52219293
  
    Addressing the major issues brought up
    
    Do we need both a stream API and a byte array one? The byte array might be 
more problematic with out of memory, but stream one might have issues if the 
streams are serialized and shuffled before being read. For reading tiff/jpg 
images, a byte array is a sufficient input, I guess there are other use cases 
as well. 
    
    Finally for the file closing, I could avoid, I believe, rewriting too much 
code by just extending the ```close``` method of my StreamBasedRecordReader 
object to try again to close the stream. The NewHadoopRDD calls this method 
anyways using your executeOnCompleteCallbacks. 
    
    I would like to leave the abstract classes  BinaryRecordReader and 
StreamFileInputFormat public since otherwise all of the implementations have to 
reside in org.apache.spark which is inconvenient for external packages, but I 
will make the others. 
    
    As for saving, I have code from my tools already but I would prefer to 
finish this pull request for input and then make a separate PR for saving since 
it is a different beast. 
    
    I have created simple test cases and added them to FileSuite and 
JavaAPISuite respectively



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to