Possible SPIP to improve matrix and vector column type support

2018-04-11 Thread Leif Walsh
Hi all,

I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
I’ve found myself wanting more.

There is a minor issue in that with the arrow serialization enabled, these
types don’t serialize properly in python UDF calls or in toPandas. There’s
a natural representation for them in numpy.ndarray, and I’ve started a
conversation with the arrow community about supporting tensor-valued
columns, but that might be a ways out. In the meantime, I think we can fix
this by using the FixedSizeBinary column type in arrow, together with some
metadata describing the tensor shape (list of dimension sizes).

The larger issue, for which I intend to submit an SPIP soon, is that these
types could be better supported at the API layer, regardless of
serialization. In the limit, we could consider the entire numpy ndarray
surface area as a target. At the minimum, what I’m thinking is that these
types should support column operations like matrix multiply, transpose,
inner and outer product, etc., and maybe have a more ergonomic construction
API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
VectorAssembler API is kind of clunky.

One possibility here is to restrict the tensor column types such that every
value must have the same shape, e.g. a 2x2 matrix. This would allow for
operations to check validity before execution, for example, a matrix
multiply could check dimension match and fail fast. However, there might be
use cases for a column to contain variable shape tensors, I’m open to
discussion here.

What do you all think?
-- 
-- 
Cheers,
Leif


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Dongjoon Hyun
Great.

If we can upgrade the parquet dependency from 1.8.2 to 1.8.3 in Apache
Spark 2.3.1, let's upgrade orc dependency from 1.4.1 to 1.4.3 together.

Currently, the patch is only merged into master branch now. 1.4.1 has the
following issue.

https://issues.apache.org/jira/browse/SPARK-23340

Bests,
Dongjoon.



On Wed, Apr 11, 2018 at 1:23 PM, Reynold Xin  wrote:

> Seems like this would make sense... we usually make maintenance releases
> for bug fixes after a month anyway.
>
>
> On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson  wrote:
>
>>
>>
>> On 11 April 2018 at 12:47, Ryan Blue  wrote:
>>
>>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>>> Spark.
>>>
>>> To be clear though, this only affects Spark when reading data written by
>>> Impala, right? Or does Parquet CPP also produce data like this?
>>>
>>
>> I don't know about parquet-cpp, but yeah, the only implementation I've
>> seen writing the half-completed stats is Impala. (as you know, that's
>> compliant with the spec, just an unusual choice).
>>
>>
>>>
>>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson 
>>> wrote:
>>>
 Hi all -

 SPARK-23852 (where a query can silently give wrong results thanks to a
 predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
 I've been involved with, we've released maintenance releases for bugs of
 this severity.

 Since Spark 2.4.0 is probably a while away, I wanted to see if there
 was any consensus over whether we should consider (at least) a 2.3.1.

 The reason this particular issue is a bit tricky is that the Parquet
 community haven't yet produced a maintenance release that fixes the
 underlying bug, but they are in the process of releasing a new minor
 version, 1.10, which includes a fix. Having spoken to a couple of Parquet
 developers, they'd be willing to consider a maintenance release, but would
 probably only bother if we (or another affected project) asked them to.

 My guess is that we wouldn't want to upgrade to a new minor version of
 Parquet for a Spark maintenance release, so asking for a Parquet
 maintenance release makes sense.

 What does everyone think?

 Best,
 Henry

>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Reynold Xin
Seems like this would make sense... we usually make maintenance releases
for bug fixes after a month anyway.


On Wed, Apr 11, 2018 at 12:52 PM, Henry Robinson  wrote:

>
>
> On 11 April 2018 at 12:47, Ryan Blue  wrote:
>
>> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
>> Spark.
>>
>> To be clear though, this only affects Spark when reading data written by
>> Impala, right? Or does Parquet CPP also produce data like this?
>>
>
> I don't know about parquet-cpp, but yeah, the only implementation I've
> seen writing the half-completed stats is Impala. (as you know, that's
> compliant with the spec, just an unusual choice).
>
>
>>
>> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson 
>> wrote:
>>
>>> Hi all -
>>>
>>> SPARK-23852 (where a query can silently give wrong results thanks to a
>>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>>> I've been involved with, we've released maintenance releases for bugs of
>>> this severity.
>>>
>>> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
>>> any consensus over whether we should consider (at least) a 2.3.1.
>>>
>>> The reason this particular issue is a bit tricky is that the Parquet
>>> community haven't yet produced a maintenance release that fixes the
>>> underlying bug, but they are in the process of releasing a new minor
>>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>>> developers, they'd be willing to consider a maintenance release, but would
>>> probably only bother if we (or another affected project) asked them to.
>>>
>>> My guess is that we wouldn't want to upgrade to a new minor version of
>>> Parquet for a Spark maintenance release, so asking for a Parquet
>>> maintenance release makes sense.
>>>
>>> What does everyone think?
>>>
>>> Best,
>>> Henry
>>>
>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Henry Robinson
On 11 April 2018 at 12:47, Ryan Blue  wrote:

> I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of
> Spark.
>
> To be clear though, this only affects Spark when reading data written by
> Impala, right? Or does Parquet CPP also produce data like this?
>

I don't know about parquet-cpp, but yeah, the only implementation I've seen
writing the half-completed stats is Impala. (as you know, that's compliant
with the spec, just an unusual choice).


>
> On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson  wrote:
>
>> Hi all -
>>
>> SPARK-23852 (where a query can silently give wrong results thanks to a
>> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
>> I've been involved with, we've released maintenance releases for bugs of
>> this severity.
>>
>> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
>> any consensus over whether we should consider (at least) a 2.3.1.
>>
>> The reason this particular issue is a bit tricky is that the Parquet
>> community haven't yet produced a maintenance release that fixes the
>> underlying bug, but they are in the process of releasing a new minor
>> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
>> developers, they'd be willing to consider a maintenance release, but would
>> probably only bother if we (or another affected project) asked them to.
>>
>> My guess is that we wouldn't want to upgrade to a new minor version of
>> Parquet for a Spark maintenance release, so asking for a Parquet
>> maintenance release makes sense.
>>
>> What does everyone think?
>>
>> Best,
>> Henry
>>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Maintenance releases for SPARK-23852?

2018-04-11 Thread Ryan Blue
I think a 1.8.3 Parquet release makes sense for the 2.3.x releases of Spark.

To be clear though, this only affects Spark when reading data written by
Impala, right? Or does Parquet CPP also produce data like this?

On Wed, Apr 11, 2018 at 12:35 PM, Henry Robinson  wrote:

> Hi all -
>
> SPARK-23852 (where a query can silently give wrong results thanks to a
> predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
> I've been involved with, we've released maintenance releases for bugs of
> this severity.
>
> Since Spark 2.4.0 is probably a while away, I wanted to see if there was
> any consensus over whether we should consider (at least) a 2.3.1.
>
> The reason this particular issue is a bit tricky is that the Parquet
> community haven't yet produced a maintenance release that fixes the
> underlying bug, but they are in the process of releasing a new minor
> version, 1.10, which includes a fix. Having spoken to a couple of Parquet
> developers, they'd be willing to consider a maintenance release, but would
> probably only bother if we (or another affected project) asked them to.
>
> My guess is that we wouldn't want to upgrade to a new minor version of
> Parquet for a Spark maintenance release, so asking for a Parquet
> maintenance release makes sense.
>
> What does everyone think?
>
> Best,
> Henry
>



-- 
Ryan Blue
Software Engineer
Netflix


Maintenance releases for SPARK-23852?

2018-04-11 Thread Henry Robinson
Hi all -

SPARK-23852 (where a query can silently give wrong results thanks to a
predicate pushdown bug in Parquet) is a fairly bad bug. In other projects
I've been involved with, we've released maintenance releases for bugs of
this severity.

Since Spark 2.4.0 is probably a while away, I wanted to see if there was
any consensus over whether we should consider (at least) a 2.3.1.

The reason this particular issue is a bit tricky is that the Parquet
community haven't yet produced a maintenance release that fixes the
underlying bug, but they are in the process of releasing a new minor
version, 1.10, which includes a fix. Having spoken to a couple of Parquet
developers, they'd be willing to consider a maintenance release, but would
probably only bother if we (or another affected project) asked them to.

My guess is that we wouldn't want to upgrade to a new minor version of
Parquet for a Spark maintenance release, so asking for a Parquet
maintenance release makes sense.

What does everyone think?

Best,
Henry


[Structured Streaming] File source, Parquet format: use of the mergeSchema option.

2018-04-11 Thread Gerard Maas
Hi,

I'm looking into the Parquet format support for the File source in
Structured Streaming.
The docs mention the use of the option 'mergeSchema' to merge the schemas
of the part files found.[1]

What would be the practical use of that in a streaming context?

In its batch counterpart, `mergeSchemas` would infer the schema superset of
the part-files found.


When using the File source + parquet format in streaming mode, we must
provide a schema to the readStream.schema(...) builder and that schema is
fixed for the duration of the stream.

My current understanding is that:

- Files containing a subset of the fields declared in the schema will
render null values for the non-existing fields.
- For files containing a superset of the fields, the additional data fields
will be lost.
- Files not matching the schema set on the streaming source, will render
all fields null for each record in the file.

Is the 'mergeSchema' option playing another role? From the user
perspective, they may think that this option would help their job cope with
schema evolution at runtime, but that does not seem to be the case.

What is the use of this option?

-kr, Gerard.


[1]
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L376