[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

thunterdb Wed, 18 Oct 2017 11:00:07 -0700

Github user thunterdb commented on the issue:

    https://github.com/apache/spark/pull/19439
  
    @hhbyyh thank you for bringing up these questions. In response to your 
questions:
    
    > Does the current schema support or plan to support image feature data in 
Floats[] or Doubles[]?
    
    It does, indirectly: this is what the field types `CV_32FXX` do. You need 
to do some low-level casting to convert the byte array to array of numbers, but 
that should not be a limitation for most applications.
    
    > Correct me if I'm wrong, I don't see ocvTypes plays any role in the code. 
If the field is for future extension. maybe It's better to keep only the 
supported types. But not all the OpenCV types.
    
    Indeed. These fields are added so that users know what values are allowed 
for these fields. A scala-friendly choice would have been sealed traits or 
enumerations, but the consensus in Spark has been for a low-level 
representation. Nothing precludes adding a case class to represent this dataset 
in the future, with more type safety information.
    
    > In most scenarios, deep learning applications use rescaled/cropped images 
(typically 256, 224 or smaller). Maybe add an extra parameter "smallSideSize" 
to the readImages method, which is more convenient for the users and we can 
avoid to cache the image of original size (which could be 100 times larger than 
the scaled image). This can be done in follow up PR.
    
    This is a good point, that we all hit. The issue here is that there is no 
unique definition of rescaling (what interpolation? crop and scale? scale and 
crop?) and each library has made different choices. This is certainly something 
that would be good for a future PR.
    
    > Not sure about the reason to include "origin" info into the image data. 
Based on my experience, path info serves better as a separate column in the 
DataFrame. (E.g. prediction)
    
    Yes, this feature has been debated. Some developers have had a compelling 
need for directly accessing some information about the origin of the image 
directly inside the image.
    
    > IMO the parameter "recursive" may not be necessary. Existing wild card 
matching can provides more functions.
    
    Indeed, this feature may not seem that useful at first glance. For some 
hadoop file systems though, in which images are accessed in a batched manner, 
it is useful to traverse these batches. This is important for performance 
reasons. This is why it is marked as experimental for the time being.
    
    > For scala API, ImageSchema should be in a separate file but not to be 
mixed with image reading.
    
    I do not have a strong opinion about this point, I will let other 
developers decide.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19439: [SPARK-21866][ML][PySpark] Adding spark image reader

Reply via email to