[ https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263589#comment-16263589 ]
Reynold Xin commented on SPARK-21866: ------------------------------------- Why not just declare an image function that loads the image data source? The function will throw an exception if one cannot be loaded. On Thu, Nov 23, 2017 at 7:53 AM Joseph K. Bradley (JIRA) <j...@apache.org> > SPIP: Image support in Spark > ---------------------------- > > Key: SPARK-21866 > URL: https://issues.apache.org/jira/browse/SPARK-21866 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.2.0 > Reporter: Timothy Hunter > Assignee: Ilya Matiach > Labels: SPIP > Fix For: 2.3.0 > > Attachments: SPIP - Image support for Apache Spark V1.1.pdf > > > h2. Background and motivation > As Apache Spark is being used more and more in the industry, some new use > cases are emerging for different data formats beyond the traditional SQL > types or the numerical types (vectors and matrices). Deep Learning > applications commonly deal with image processing. A number of projects add > some Deep Learning capabilities to Spark (see list below), but they struggle > to communicate with each other or with MLlib pipelines because there is no > standard way to represent an image in Spark DataFrames. We propose to > federate efforts for representing images in Spark by defining a > representation that caters to the most common needs of users and library > developers. > This SPIP proposes a specification to represent images in Spark DataFrames > and Datasets (based on existing industrial standards), and an interface for > loading sources of images. It is not meant to be a full-fledged image > processing library, but rather the core description that other libraries and > users can rely on. Several packages already offer various processing > facilities for transforming images or doing more complex operations, and each > has various design tradeoffs that make them better as standalone solutions. > This project is a joint collaboration between Microsoft and Databricks, which > have been testing this design in two open source packages: MMLSpark and Deep > Learning Pipelines. > The proposed image format is an in-memory, decompressed representation that > targets low-level applications. It is significantly more liberal in memory > usage than compressed image representations such as JPEG, PNG, etc., but it > allows easy communication with popular image processing libraries and has no > decoding overhead. > h2. Targets users and personas: > Data scientists, data engineers, library developers. > The following libraries define primitives for loading and representing > images, and will gain from a common interchange format (in alphabetical > order): > * BigDL > * DeepLearning4J > * Deep Learning Pipelines > * MMLSpark > * TensorFlow (Spark connector) > * TensorFlowOnSpark > * TensorFrames > * Thunder > h2. Goals: > * Simple representation of images in Spark DataFrames, based on pre-existing > industrial standards (OpenCV) > * This format should eventually allow the development of high-performance > integration points with image processing libraries such as libOpenCV, Google > TensorFlow, CNTK, and other C libraries. > * The reader should be able to read popular formats of images from > distributed sources. > h2. Non-Goals: > Images are a versatile medium and encompass a very wide range of formats and > representations. This SPIP explicitly aims at the most common use case in the > industry currently: multi-channel matrices of binary, int32, int64, float or > double data that can fit comfortably in the heap of the JVM: > * the total size of an image should be restricted to less than 2GB (roughly) > * the meaning of color channels is application-specific and is not mandated > by the standard (in line with the OpenCV standard) > * specialized formats used in meteorology, the medical field, etc. are not > supported > * this format is specialized to images and does not attempt to solve the more > general problem of representing n-dimensional tensors in Spark > h2. Proposed API changes > We propose to add a new package in the package structure, under the MLlib > project: > {{org.apache.spark.image}} > h3. Data format > We propose to add the following structure: > imageSchema = StructType([ > * StructField("mode", StringType(), False), > ** The exact representation of the data. > ** The values are described in the following OpenCV convention. Basically, > the type has both "depth" and "number of channels" info: in particular, type > "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 > (value 32 in the table) with the channel order specified by convention. > ** The exact channel ordering and meaning of each channel is dictated by > convention. By default, the order is RGB (3 channels) and BGRA (4 channels). > If the image failed to load, the value is the empty string "". > * StructField("origin", StringType(), True), > ** Some information about the origin of the image. The content of this is > application-specific. > ** When the image is loaded from files, users should expect to find the file > name in this field. > * StructField("height", IntegerType(), False), > ** the height of the image, pixels > ** If the image fails to load, the value is -1. > * StructField("width", IntegerType(), False), > ** the width of the image, pixels > ** If the image fails to load, the value is -1. > * StructField("nChannels", IntegerType(), False), > ** The number of channels in this image: it is typically a value of 1 (B&W), > 3 (RGB), or 4 (BGRA) > ** If the image fails to load, the value is -1. > * StructField("data", BinaryType(), False) > ** packed array content. Due to implementation limitation, it cannot > currently store more than 2 billions of pixels. > ** The data is stored in a pixel-by-pixel BGR row-wise order. This follows > the OpenCV convention. > ** If the image fails to load, this array is empty. > For more information about image types, here is an OpenCV guide on types: > http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed-pixel-types-limited-use-of-templates > > The reference implementation provides some functions to convert popular > formats (JPEG, PNG, etc.) to the image specification above, and some > functions to verify if an image is valid. > h2. Image ingest API > We propose the following function to load images from a remote distributed > source as a DataFrame. Here is the signature in Scala. The python interface > is similar. For compatibility with java, this function should be made > available through a builder pattern or through the DataSource API. The exact > mechanics can be discussed during implementation; the goal of the proposal > below is to propose a specification of the behavior. > {code} > def readImages( > path: String, > session: SparkSession = null, > recursive: Boolean = false, > numPartitions: Int = 0, > dropImageFailures: Boolean = false, > // Experimental options > sampleRatio: Double = 1.0): DataFrame > {code} > The type of the returned DataFrame should be the structure type above, with > the expectation that all the file names be filled. > Mandatory parameters: > * *path*: a directory for a file system that contains images > Optional parameters: > * *session* (SparkSession, default null): the Spark Session to use to create > the dataframe. If not provided, it will use the current default Spark session > via SparkSession.getOrCreate(). > * *recursive* (bool, default false): take the top-level images or look into > directory recursively > * *numPartitions* (int, default null): the number of partitions of the final > dataframe. By default uses the default number of partitions from Spark. > * *dropImageFailures* (bool, default false): drops the files that failed to > load. If false (do not drop), some invalid images are kept. > > Parameters that are experimental/may be quickly deprecated. These would be > useful to have but are not critical for a first cut: > * *sampleRatio* (float, in (0,1), default 1): if less than 1, returns a > fraction of the data. There is no statistical guarantee about how the > sampling is performed. This proved to be very helpful for fast prototyping. > Marked as experimental since it should be pushed to the Spark core. > The implementation is expected to be in Scala for performance, with a wrapper > for python. > This function should be lazy to the extent possible: it should not trigger > access to the data when called. Ideally, any file system supported by Spark > should be supported when loading images. There may be restrictions for some > options such as zip files, etc. > The reference implementation has also some experimental options (undocumented > here). > h2. Reference implementation > A reference implementation is available as an open-source Spark package in > this repository (Apache 2.0 license): > https://github.com/Microsoft/spark-images > This Spark package will also be published in a binary form on > spark-packages.org . > Comments about the API should be addressed in this ticket. > h2. Optional Rejected Designs > The use of User-Defined Types was considered. It adds some burden to the > implementation of various languages and does not provide significant > advantages. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org