Github user thunterdb commented on the issue: https://github.com/apache/spark/pull/19439 @hhbyyh thank you for bringing up these questions. In response to your questions: > Does the current schema support or plan to support image feature data in Floats[] or Doubles[]? It does, indirectly: this is what the field types `CV_32FXX` do. You need to do some low-level casting to convert the byte array to array of numbers, but that should not be a limitation for most applications. > Correct me if I'm wrong, I don't see ocvTypes plays any role in the code. If the field is for future extension. maybe It's better to keep only the supported types. But not all the OpenCV types. Indeed. These fields are added so that users know what values are allowed for these fields. A scala-friendly choice would have been sealed traits or enumerations, but the consensus in Spark has been for a low-level representation. Nothing precludes adding a case class to represent this dataset in the future, with more type safety information. > In most scenarios, deep learning applications use rescaled/cropped images (typically 256, 224 or smaller). Maybe add an extra parameter "smallSideSize" to the readImages method, which is more convenient for the users and we can avoid to cache the image of original size (which could be 100 times larger than the scaled image). This can be done in follow up PR. This is a good point, that we all hit. The issue here is that there is no unique definition of rescaling (what interpolation? crop and scale? scale and crop?) and each library has made different choices. This is certainly something that would be good for a future PR. > Not sure about the reason to include "origin" info into the image data. Based on my experience, path info serves better as a separate column in the DataFrame. (E.g. prediction) Yes, this feature has been debated. Some developers have had a compelling need for directly accessing some information about the origin of the image directly inside the image. > IMO the parameter "recursive" may not be necessary. Existing wild card matching can provides more functions. Indeed, this feature may not seem that useful at first glance. For some hadoop file systems though, in which images are accessed in a batched manner, it is useful to traverse these batches. This is important for performance reasons. This is why it is marked as experimental for the time being. > For scala API, ImageSchema should be in a separate file but not to be mixed with image reading. I do not have a strong opinion about this point, I will let other developers decide.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org