Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

Matei Zaharia Sat, 23 Sep 2017 10:22:48 -0700

+1; we should consider something similar for multi-dimensional tensors too.


Matei

> On Sep 23, 2017, at 7:27 AM, Yanbo Liang <[email protected]> wrote:
> 
> +1
> 
> On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan <[email protected]> wrote:
> +1 
> 
> Regards 
> Noman 
> From: Denny Lee <[email protected]>
> Sent: Friday, September 22, 2017 2:59:33 AM
> To: Apache Spark Dev; Sean Owen; Tim Hunter
> Cc: Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Sudarshan
> Subject: Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark
>  
> +1 
> 
> On Thu, Sep 21, 2017 at 11:15 Sean Owen <[email protected]> wrote:
> Am I right that this doesn't mean other packages would use this 
> representation, but that they could?
> 
> The representation looked fine to me w.r.t. what DL frameworks need.
> 
> My previous comment was that this is actually quite lightweight. It's kind of 
> like how I/O support is provided for CSV and JSON, so makes enough sense to 
> add to Spark. It doesn't really preclude other solutions.
> 
> For those reasons I think it's fine. +1
> 
> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter <[email protected]> wrote:
> Hello community,
> 
> I would like to call for a vote on SPARK-21866. It is a short proposal that 
> has important applications for image processing and deep learning. Joseph 
> Bradley has offered to be the shepherd.
> 
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> PDF version: 
> https://issues.apache.org/jira/secure/attachment/12884792/SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
> 
> Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
>       • BigDL
>       • DeepLearning4J
>       • Deep Learning Pipelines
>       • MMLSpark
>       • TensorFlow (Spark connector)
>       • TensorFlowOnSpark
>       • TensorFrames
>       • Thunder
> Goals:
>       • Simple representation of images in Spark DataFrames, based on 
> pre-existing industrial standards (OpenCV)
>       • This format should eventually allow the development of 
> high-performance integration points with image processing libraries such as 
> libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>       • The reader should be able to read popular formats of images from 
> distributed sources.
> Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
>       • the total size of an image should be restricted to less than 2GB 
> (roughly)
>       • the meaning of color channels is application-specific and is not 
> mandated by the standard (in line with the OpenCV standard)
>       • specialized formats used in meteorology, the medical field, etc. are 
> not supported
>       • this format is specialized to images and does not attempt to solve 
> the more general problem of representing n-dimensional tensors in Spark
> Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> org.apache.spark.image
> Data format
> We propose to add the following structure:
> imageSchema = StructType([
>       • StructField("mode", StringType(), False),
>               • The exact representation of the data.
>               • The values are described in the following OpenCV convention. 
> Basically, the type has both "depth" and "number of channels" info: in 
> particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format 
> would be CV_8UC4 (value 32 in the table) with the channel order specified by 
> convention.
>               • The exact channel ordering and meaning of each channel is 
> dictated by convention. By default, the order is RGB (3 channels) and BGRA (4 
> channels).
> If the image failed to load, the value is the empty string "".
>       • StructField("origin", StringType(), True),
>               • Some information about the origin of the image. The content 
> of this is application-specific.
>               • When the image is loaded from files, users should expect to 
> find the file name in this field.
>       • StructField("height", IntegerType(), False),
>               • the height of the image, pixels
>               • If the image fails to load, the value is -1.
>       • StructField("width", IntegerType(), False),
>               • the width of the image, pixels
>               • If the image fails to load, the value is -1.
>       • StructField("nChannels", IntegerType(), False),
>               • The number of channels in this image: it is typically a value 
> of 1 (B&W), 3 (RGB), or 4 (BGRA)
>               • If the image fails to load, the value is -1.
>       • StructField("data", BinaryType(), False)
>               • packed array content. Due to implementation limitation, it 
> cannot currently store more than 2 billions of pixels.
>               • The data is stored in a pixel-by-pixel BGR row-wise order. 
> This follows the OpenCV convention.
>               • If the image fails to load, this array is empty.
> For more information about image types, here is an OpenCV guide on types: 
> http://docs.opencv.org/2.4/modules/core/doc/intro.html#fixed-pixel-types-limited-use-of-templates
> The reference implementation provides some functions to convert popular 
> formats (JPEG, PNG, etc.) to the image specification above, and some 
> functions to verify if an image is valid.
> Image ingest API
> We propose the following function to load images from a remote distributed 
> source as a DataFrame. Here is the signature in Scala. The python interface 
> is similar. For compatibility with java, this function should be made 
> available through a builder pattern or through the DataSource API. The exact 
> mechanics can be discussed during implementation; the goal of the proposal 
> below is to propose a specification of the behavior and of the options:
> def readImages(
>     path: 
> String
> ,
>     session: SparkSession = 
> null
> ,
>     recursive: 
> Boolean = false
> ,
>     numPartitions: Int = 0,
>     dropImageFailures: 
> Boolean = false
> ,
>     
> // Experimental options
> 
>     sampleRatio: Double
>  = 1.0): DataFrame
> 
> The type of the returned DataFrame should be the structure type above, with 
> the expectation that all the file names be filled.
> Mandatory parameters:
>       • path: a directory for a file system that contains images
> Optional parameters:
>       • session (SparkSession, default null): the Spark Session to use to 
> create the dataframe. If not provided, it will use the current default Spark 
> session via SparkSession.getOrCreate().
>       • recursive (bool, default false): take the top-level images or look 
> into directory recursively
>       • numPartitions (int, default null): the number of partitions of the 
> final dataframe. By default uses the default number of partitions from Spark.
>       • dropImageFailures (bool, default false): drops the files that failed 
> to load. If false (do not drop), some invalid images are kept.
> Parameters that are experimental/may be quickly deprecated. These would be 
> useful to have but are not critical for a first cut:
>       • sampleRatio (float, in (0,1), default 1): if less than 1, returns a 
> fraction of the data. There is no statistical guarantee about how the 
> sampling is performed. This proved to be very helpful for fast prototyping. 
> Marked as experimental since it should be pushed to the Spark core.
> The implementation is expected to be in Scala for performance, with a wrapper 
> for python.
> This function should be lazy to the extent possible: it should not trigger 
> access to the data when called. Ideally, any file system supported by Spark 
> should be supported when loading images. There may be restrictions for some 
> options such as zip files, etc.
> The reference implementation has also some experimental options (undocumented 
> here).
> Reference implementation
> A reference implementation is available as an open-source Spark package in 
> this repository (Apache 2.0 license):
> https://github.com/Microsoft/spark-images
> This Spark package will also be published in a binary form on 
> spark-packages.org .
> Comments about the API should be addressed in this ticket.
> Optional Rejected Designs
> The use of User-Defined Types was considered. It adds some burden to the 
> implementation of various languages and does not provide significant 
> advantages.
> 


---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

Reply via email to