[
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joseph K. Bradley resolved SPARK-21866.
---------------------------------------
Resolution: Fixed
Fix Version/s: 2.3.0
Issue resolved by pull request 19439
[https://github.com/apache/spark/pull/19439]
> SPIP: Image support in Spark
> ----------------------------
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.2.0
> Reporter: Timothy Hunter
> Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use
> cases are emerging for different data formats beyond the traditional SQL
> types or the numerical types (vectors and matrices). Deep Learning
> applications commonly deal with image processing. A number of projects add
> some Deep Learning capabilities to Spark (see list below), but they struggle
> to communicate with each other or with MLlib pipelines because there is no
> standard way to represent an image in Spark DataFrames. We propose to
> federate efforts for representing images in Spark by defining a
> representation that caters to the most common needs of users and library
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames
> and Datasets (based on existing industrial standards), and an interface for
> loading sources of images. It is not meant to be a full-fledged image
> processing library, but rather the core description that other libraries and
> users can rely on. Several packages already offer various processing
> facilities for transforming images or doing more complex operations, and each
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which
> have been testing this design in two open source packages: MMLSpark and Deep
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that
> targets low-level applications. It is significantly more liberal in memory
> usage than compressed image representations such as JPEG, PNG, etc., but it
> allows easy communication with popular image processing libraries and has no
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing
> images, and will gain from a common interchange format (in alphabetical
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance
> integration points with image processing libraries such as libOpenCV, Google
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and
> representations. This SPIP explicitly aims at the most common use case in the
> industry currently: multi-channel matrices of binary, int32, int64, float or
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not
> supported
> * this format is specialized to images and does not attempt to solve the more
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically,
> the type has both "depth" and "number of channels" info: in particular, type
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of this is
> application-specific.
> ** When the image is loaded from files, users should expect to find the file
> name in this field.
> * StructField("height", IntegerType(), False),
> ** the height of the image, pixels
> ** If the image fails to load, the value is -1.
> * StructField("width", IntegerType(), False),
> ** the width of the image, pixels
> ** If the image fails to load, the value is -1.
> * StructField("nChannels", IntegerType(), False),
> ** The number of channels in this image: it is typically a value of 1 (B&W),
> 3 (RGB), or 4 (BGRA)
> ** If the image fails to load, the value is -1.
> * StructField("data", BinaryType(), False)
> ** packed array content. Due to implementation limitation, it cannot
> currently store more than 2 billions of pixels.
> ** The data is stored in a pixel-by-pixel BGR row-wise order. This follows
> the OpenCV convention.
> ** If the image fails to load, this array is empty.
> For more information about image types, here is an OpenCV guide on types:
> http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed-pixel-types-limited-use-of-templates
>
> The reference implementation provides some functions to convert popular
> formats (JPEG, PNG, etc.) to the image specification above, and some
> functions to verify if an image is valid.
> h2. Image ingest API
> We propose the following function to load images from a remote distributed
> source as a DataFrame. Here is the signature in Scala. The python interface
> is similar. For compatibility with java, this function should be made
> available through a builder pattern or through the DataSource API. The exact
> mechanics can be discussed during implementation; the goal of the proposal
> below is to propose a specification of the behavior.
> {code}
> def readImages(
> path: String,
> session: SparkSession = null,
> recursive: Boolean = false,
> numPartitions: Int = 0,
> dropImageFailures: Boolean = false,
> // Experimental options
> sampleRatio: Double = 1.0): DataFrame
> {code}
> The type of the returned DataFrame should be the structure type above, with
> the expectation that all the file names be filled.
> Mandatory parameters:
> * *path*: a directory for a file system that contains images
> Optional parameters:
> * *session* (SparkSession, default null): the Spark Session to use to create
> the dataframe. If not provided, it will use the current default Spark session
> via SparkSession.getOrCreate().
> * *recursive* (bool, default false): take the top-level images or look into
> directory recursively
> * *numPartitions* (int, default null): the number of partitions of the final
> dataframe. By default uses the default number of partitions from Spark.
> * *dropImageFailures* (bool, default false): drops the files that failed to
> load. If false (do not drop), some invalid images are kept.
>
> Parameters that are experimental/may be quickly deprecated. These would be
> useful to have but are not critical for a first cut:
> * *sampleRatio* (float, in (0,1), default 1): if less than 1, returns a
> fraction of the data. There is no statistical guarantee about how the
> sampling is performed. This proved to be very helpful for fast prototyping.
> Marked as experimental since it should be pushed to the Spark core.
> The implementation is expected to be in Scala for performance, with a wrapper
> for python.
> This function should be lazy to the extent possible: it should not trigger
> access to the data when called. Ideally, any file system supported by Spark
> should be supported when loading images. There may be restrictions for some
> options such as zip files, etc.
> The reference implementation has also some experimental options (undocumented
> here).
> h2. Reference implementation
> A reference implementation is available as an open-source Spark package in
> this repository (Apache 2.0 license):
> https://github.com/Microsoft/spark-images
> This Spark package will also be published in a binary form on
> spark-packages.org .
> Comments about the API should be addressed in this ticket.
> h2. Optional Rejected Designs
> The use of User-Defined Types was considered. It adds some burden to the
> implementation of various languages and does not provide significant
> advantages.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]