[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2018-01-27 Thread Denny Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denny Lee updated SPARK-21866:
--
Description: 
h2. Background and motivation

As Apache Spark is being used more and more in the industry, some new use cases 
are emerging for different data formats beyond the traditional SQL types or the 
numerical types (vectors and matrices). Deep Learning applications commonly 
deal with image processing. A number of projects add some Deep Learning 
capabilities to Spark (see list below), but they struggle to communicate with 
each other or with MLlib pipelines because there is no standard way to 
represent an image in Spark DataFrames. We propose to federate efforts for 
representing images in Spark by defining a representation that caters to the 
most common needs of users and library developers.

This SPIP proposes a specification to represent images in Spark DataFrames and 
Datasets (based on existing industrial standards), and an interface for loading 
sources of images. It is not meant to be a full-fledged image processing 
library, but rather the core description that other libraries and users can 
rely on. Several packages already offer various processing facilities for 
transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.

This project is a joint collaboration between Microsoft and Databricks, which 
have been testing this design in two open source packages: MMLSpark and Deep 
Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.
h2. Targets users and personas:

Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing images, 
and will gain from a common interchange format (in alphabetical order):
 * BigDL
 * DeepLearning4J
 * Deep Learning Pipelines
 * MMLSpark
 * TensorFlow (Spark connector)
 * TensorFlowOnSpark
 * TensorFrames
 * Thunder

h2. Goals:
 * Simple representation of images in Spark DataFrames, based on pre-existing 
industrial standards (OpenCV)
 * This format should eventually allow the development of high-performance 
integration points with image processing libraries such as libOpenCV, Google 
TensorFlow, CNTK, and other C libraries.
 * The reader should be able to read popular formats of images from distributed 
sources.

h2. Non-Goals:

Images are a versatile medium and encompass a very wide range of formats and 
representations. This SPIP explicitly aims at the most common use case in the 
industry currently: multi-channel matrices of binary, int32, int64, float or 
double data that can fit comfortably in the heap of the JVM:
 * the total size of an image should be restricted to less than 2GB (roughly)
 * the meaning of color channels is application-specific and is not mandated by 
the standard (in line with the OpenCV standard)
 * specialized formats used in meteorology, the medical field, etc. are not 
supported
 * this format is specialized to images and does not attempt to solve the more 
general problem of representing n-dimensional tensors in Spark

h2. Proposed API changes

We propose to add a new package in the package structure, under the MLlib 
project:
 {{org.apache.spark.image}}
h3. Data format

We propose to add the following structure:

imageSchema = StructType([
 * StructField("mode", StringType(), False),
 ** The exact representation of the data.
 ** The values are described in the following OpenCV convention. Basically, the 
type has both "depth" and "number of channels" info: in particular, type 
"CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 
32 in the table) with the channel order specified by convention.
 ** The exact channel ordering and meaning of each channel is dictated by 
convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
If the image failed to load, the value is the empty string "".

 * StructField("origin", StringType(), True),
 ** Some information about the origin of the image. The content of this is 
application-specific.
 ** When the image is loaded from files, users should expect to find the file 
name in this field.

 * StructField("height", IntegerType(), False),
 ** the height of the image, pixels
 ** If the image fails to load, the value is -1.

 * StructField("width", IntegerType(), False),
 ** the width of the image, pixels
 ** If the image fails to load, the value is -1.

 * StructField("nChannels", IntegerType(), False),
 ** The number of channels in this image: it is typically a value of 1 (B&W), 3 
(RGB), or 4 (BGRA)
 ** If the image fails to

[jira] [Created] (SPARK-18426) Python Documentation Fix for Structured Streaming Programming Guide

2016-11-13 Thread Denny Lee (JIRA)
Denny Lee created SPARK-18426:
-

 Summary: Python Documentation Fix for Structured Streaming 
Programming Guide
 Key: SPARK-18426
 URL: https://issues.apache.org/jira/browse/SPARK-18426
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Denny Lee
Priority: Minor
 Fix For: 2.0.2


When running python example in Structured Streaming Guide, get the error:
spark = SparkSession\
TypeError: 'Builder' object is not callable

This is fixed by changing .builder() to .builder 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18200) GraphX Invalid initial capacity when running triangleCount

2016-11-01 Thread Denny Lee (JIRA)
Denny Lee created SPARK-18200:
-

 Summary: GraphX Invalid initial capacity when running triangleCount
 Key: SPARK-18200
 URL: https://issues.apache.org/jira/browse/SPARK-18200
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.0.1, 2.0.0, 2.0.2
 Environment: Databricks, Ubuntu 16.04, macOS Sierra
Reporter: Denny Lee


Running GraphX triangle count on large-ish file results in the "Invalid initial 
capacity" error when running on Spark 2.0 (tested on Spark 2.0, 2.0.1, and 
2.0.2).  You can see the results at: http://bit.ly/2eQKWDN

Running the same code on Spark 1.6 and the query completes without any 
problems: http://bit.ly/2fATO1M

As well, running the GraphFrames version of this code runs as well (Spark 2.0, 
GraphFrames 0.2): http://bit.ly/2fAS8W8

Reference Stackoverflow question:
Spark GraphX: requirement failed: Invalid initial capacity 
(http://stackoverflow.com/questions/40337366/spark-graphx-requirement-failed-invalid-initial-capacity)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12036) No applicable constructor/method when calling collect on a Dataset

2015-11-29 Thread Denny Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15031121#comment-15031121
 ] 

Denny Lee commented on SPARK-12036:
---

I just tested it out with the patch and it looks like this resolves it.  Thanks!

> No applicable constructor/method when calling collect on a Dataset
> --
>
> Key: SPARK-12036
> URL: https://issues.apache.org/jira/browse/SPARK-12036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> In Spark shell, I tried 
> {code}
> case class Person(name: String, age: Int)
> val dataframe = 
> sqlContext.read.json("/Users/yhuai/Projects/Spark/yin-spark-2/examples/src/main/resources/people.json")
> val ds = dataframe.as[Person]
> ds.collect
> {code}
> Then, I got
> {code}
> 15/11/28 10:40:51 ERROR GenerateSafeProjection: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 46, Column 67: No applicable constructor/method found for actual parameters 
> "java.lang.String, long"; candidates are: 
> "$line15.$read$$iwC$$iwC$Person(java.lang.String, int)"
> /* 001 */ 
> /* 002 */ public java.lang.Object 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
> /* 003 */   return new SpecificSafeProjection(expr);
> /* 004 */ }
> /* 005 */ 
> /* 006 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 007 */   
> /* 008 */   private org.apache.spark.sql.catalyst.expressions.Expression[] 
> expressions;
> /* 009 */   private org.apache.spark.sql.catalyst.expressions.MutableRow 
> mutableRow;
> /* 010 */   
> /* 011 */   
> /* 012 */   
> /* 013 */   public 
> SpecificSafeProjection(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
> /* 014 */ expressions = expr;
> /* 015 */ mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(1);
> /* 016 */ 
> /* 017 */   }
> /* 018 */   
> /* 019 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 020 */ InternalRow i = (InternalRow) _i;
> /* 021 */ /* newinstance(class 
> $line15.$read$$iwC$$iwC$Person,invoke(input[1, 
> StringType],toString,ObjectType(class java.lang.String)),input[0, 
> LongType],false,ObjectType(class 
> $line15.$read$$iwC$$iwC$Person),Some($line15.$read$$iwC$$iwC@62303b81)) */
> /* 022 */ /* invoke(input[1, StringType],toString,ObjectType(class 
> java.lang.String)) */
> /* 023 */ /* input[1, StringType] */
> /* 024 */ boolean isNull4 = i.isNullAt(1);
> /* 025 */ UTF8String primitive5 = isNull4 ? null : (i.getUTF8String(1));
> /* 026 */ 
> /* 027 */ 
> /* 028 */ boolean isNull2 = primitive5 == null;
> /* 029 */ java.lang.String primitive3 =
> /* 030 */ isNull2 ?
> /* 031 */ null : (java.lang.String) primitive5.toString();
> /* 032 */ isNull2 = primitive3 == null;
> /* 033 */ /* input[0, LongType] */
> /* 034 */ boolean isNull6 = i.isNullAt(0);
> /* 035 */ long primitive7 = isNull6 ? -1L : (i.getLong(0));
> /* 036 */ /* $line15.$read$$iwC$$iwC@62303b81 */
> /* 037 */ /* expression: $line15.$read$$iwC$$iwC@62303b81 */
> /* 038 */ java.lang.Object obj10 = expressions[0].eval(i);
> /* 039 */ boolean isNull8 = obj10 == null;
> /* 040 */ $line15.$read$$iwC$$iwC primitive9 = null;
> /* 041 */ if (!isNull8) {
> /* 042 */   primitive9 = ($line15.$read$$iwC$$iwC) obj10;
> /* 043 */ }
> /* 044 */ 
> /* 045 */ 
> /* 046 */ $line15.$read$$iwC$$iwC$Person primitive1 = primitive9.new 
> Person(primitive3, primitive7);
> /* 047 */ final boolean isNull0 = primitive1 == null;
> /* 048 */ if (isNull0) {
> /* 049 */   mutableRow.setNullAt(0);
> /* 050 */ } else {
> /* 051 */   
> /* 052 */   mutableRow.update(0, primitive1);
> /* 053 */ }
> /* 054 */ 
> /* 055 */ return mutableRow;
> /* 056 */   }
> /* 057 */ }
> /* 058 */ 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 46, Column 67: No applicable constructor/method found for actual parameters 
> "java.lang.String, long"; candidates are: 
> "$line15.$read$$iwC$$iwC$Person(java.lang.String, int)"
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:7559)
>   at 
> org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6505)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126)
>   at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185)
>   at 
> org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275)
>   at org.codehaus.ja