Re: Helper methods for PySpark discussion

2018-10-27 Thread Leif Walsh
In the case of len, I think we should examine how python does iterators and
generators. https://docs.python.org/3/library/collections.abc.html

Iterators have __iter__ and __next__ but are not Sized so they don’t have
__len__. If you ask for the len() of a generator (like len(x for x in
range(10) if x % 2 == 0)) you get a reasonable error message and might
respond by calling len(list(g)) if you know you can afford to materialize
g’s contents. Of course, with a DataFrame materializing all the columns for
all rows back on the python side is way more expensive than df.count(), so
we don’t want to ever steer people to call len(list(df)), but I think
len(df) having an expensive side effect would definitely surprise me.

Perhaps we can consider the abstract base classes that DataFrames and RDDs
should implement. I actually think it’s not many of them, we don’t have
random access, or sizes, or even a cheap way to do set membership.

For the case of len(), I think the best option is to show an error message
that tells you to call count instead.
On Fri, Oct 26, 2018 at 21:06 Holden Karau  wrote:

> Ok so let's say you made a spark dataframe, you call length -- what do you
> expect to happen?
>
> Personallt I expect Spark to evaluate the dataframe, this is what happens
> with collections and even iterables.
>
> The interplay with cache is a bit strange, but presumably if you've marked
> your Dataframe for caching you want to cache it (we don't automatically
> madk Dataframes for caching outside of some cases inside ML pipelines where
> this would not apply).
>
> On Fri, Oct 26, 2018, 10:56 AM Li Jin 
>> > (2) If the method forces evaluation this matches most obvious way that
>> would implemented then we should add it with a note in the docstring
>>
>> I am not sure about this because force evaluation could be something that
>> has side effect. For example, df.count() can realize a cache and if we
>> implement __len__ to call df.count() then len(df) would end up populating
>> some cache and can be unintuitive.
>>
>> On Fri, Oct 26, 2018 at 1:21 PM Leif Walsh  wrote:
>>
>>> That all sounds reasonable but I think in the case of 4 and maybe also 3
>>> I would rather see it implemented to raise an error message that explains
>>> what’s going on and suggests the explicit operation that would do the most
>>> equivalent thing. And perhaps raise a warning (using the warnings module)
>>> for things that might be unintuitively expensive.
>>> On Fri, Oct 26, 2018 at 12:15 Holden Karau  wrote:
>>>
>>>> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
>>>> the helper methods in question made sense but there was some desire for a
>>>> plan as to which helper methods we should use.
>>>>
>>>> I'd like to purpose a light weight solution to start with for helper
>>>> methods that match either Pandas or general Python collection helper
>>>> methods:
>>>> 1) If the helper method doesn't collect the DataFrame back or force
>>>> evaluation to the driver then we should add it without discussion
>>>> 2) If the method forces evaluation this matches most obvious way that
>>>> would implemented then we should add it with a note in the docstring
>>>> 3) If the method does collect the DataFrame back to the driver and that
>>>> is the most obvious way it would implemented (e.g. calling list to get back
>>>> a list would have to collect the DataFrame) then we should add it with a
>>>> warning in the docstring
>>>> 4) If the method collects the DataFrame but a reasonable Python
>>>> developer wouldn't expect that behaviour not implementing the helper method
>>>> would be better
>>>>
>>>> What do folks think?
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>> --
-- 
Cheers,
Leif


Re: Helper methods for PySpark discussion

2018-10-26 Thread Leif Walsh
That all sounds reasonable but I think in the case of 4 and maybe also 3 I
would rather see it implemented to raise an error message that explains
what’s going on and suggests the explicit operation that would do the most
equivalent thing. And perhaps raise a warning (using the warnings module)
for things that might be unintuitively expensive.
On Fri, Oct 26, 2018 at 12:15 Holden Karau  wrote:

> Coming out of https://github.com/apache/spark/pull/21654 it was agreed
> the helper methods in question made sense but there was some desire for a
> plan as to which helper methods we should use.
>
> I'd like to purpose a light weight solution to start with for helper
> methods that match either Pandas or general Python collection helper
> methods:
> 1) If the helper method doesn't collect the DataFrame back or force
> evaluation to the driver then we should add it without discussion
> 2) If the method forces evaluation this matches most obvious way that
> would implemented then we should add it with a note in the docstring
> 3) If the method does collect the DataFrame back to the driver and that is
> the most obvious way it would implemented (e.g. calling list to get back a
> list would have to collect the DataFrame) then we should add it with a
> warning in the docstring
> 4) If the method collects the DataFrame but a reasonable Python developer
> wouldn't expect that behaviour not implementing the helper method would be
> better
>
> What do folks think?
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
-- 
Cheers,
Leif


Re: Python friendly API for Spark 3.0

2018-09-17 Thread Leif Walsh
I agree with Reynold, at some point you’re going to run into the parts of
the pandas API that aren’t distributable. More feature parity will be good,
but users are still eventually going to hit a feature cliff. Moreover, it’s
not just the pandas API that people want to use, but also the set of
libraries built around the pandas DataFrame structure.

I think rather than similarity to pandas, we should target smoother
interoperability with pandas, to ease the pain of hitting this cliff.

We’ve been working on part of this problem with the pandas UDF stuff, but
there’s a lot more to do.

On Sun, Sep 16, 2018 at 17:13 Reynold Xin  wrote:

> Most of those are pretty difficult to add though, because they are
> fundamentally difficult to do in a distributed setting and with lazy
> execution.
>
> We should add some but at some point there are fundamental differences
> between the underlying execution engine that are pretty difficult to
> reconcile.
>
> On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
> wrote:
>
>> My 2 cents on this is that the biggest room for improvement in Python is
>> similarity to Pandas. We already made the Python DataFrame API different
>> from Scala/Java in some respects, but if there’s anything we can do to make
>> it more obvious to Pandas users, that will help the most. The other issue
>> though is that a bunch of Pandas functions are just missing in Spark — it
>> would be awesome to set up an umbrella JIRA to just track those and let
>> people fill them in.
>>
>> Matei
>>
>> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
>> wrote:
>> >
>> > It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>> >
>> > For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>> >
>> > It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>> >
>> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>> > I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> >
>> > They're not mutually exclusive of course, and both worth working on.
>> But it's *possible* to improve either without the other.
>> >
>> > Stub files look like a great solution for type annotations, maybe even
>> if only python 3 is supported.
>> >
>> > I definitely agree that any decision to drop python 2 should not be
>> taken lightly. Anecdotally, I'm seeing an increase in 

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Leif Walsh
Hey there,

Here’s something I proposed recently that’s in this space.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24258

It’s motivated by working with a user who wanted to do some custom
statistics for which they could write the numpy code, and knew in what
dimensions they could parallelize it, but in actually getting it running,
the type system really got in the way.


On Fri, Sep 14, 2018 at 15:15 Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
-- 
Cheers,
Leif


Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Leif Walsh
I’m with you on json being more readable than parquet, but we’ve had
success using pyarrow’s parquet reader and have been quite happy with it so
far. If your target is python (and probably if not now, then soon, R), you
should look in to it.

On Mon, May 21, 2018 at 16:52 Joseph Bradley  wrote:

> Regarding model reading and writing, I'll give quick thoughts here:
> * Our approach was to use the same format but write JSON instead of
> Parquet.  It's easier to parse JSON without Spark, and using the same
> format simplifies architecture.  Plus, some people want to check files into
> version control, and JSON is nice for that.
> * The reader/writer APIs could be extended to take format parameters (just
> like DataFrame reader/writers) to handle JSON (and maybe, eventually,
> handle Parquet in the online serving setting).
>
> This would be a big project, so proposing a SPIP might be best.  If people
> are around at the Spark Summit, that could be a good time to meet up & then
> post notes back to the dev list.
>
> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung 
> wrote:
>
>> Specifically I’d like bring part of the discussion to Model and
>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>> that rely on SparkContext. This is a big blocker on reusing  trained models
>> outside of Spark for online serving.
>>
>> What’s the next step? Would folks be interested in getting together to
>> discuss/get some feedback?
>>
>>
>> _
>> From: Felix Cheung 
>> Sent: Thursday, May 10, 2018 10:10 AM
>> Subject: Re: Revisiting Online serving of Spark models?
>> To: Holden Karau , Joseph Bradley <
>> jos...@databricks.com>
>> Cc: dev 
>>
>>
>>
>> Huge +1 on this!
>>
>> --
>> *From:* holden.ka...@gmail.com  on behalf of
>> Holden Karau 
>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>> *To:* Joseph Bradley
>> *Cc:* dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>>
>>
>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
>> wrote:
>>
>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>
>>> Awesome! I'm glad other folks think something like this belongs in Spark.
>>
>>> This was one of the original goals for mllib-local: to have local
>>> versions of MLlib models which could be deployed without the big Spark JARs
>>> and without a SparkContext or SparkSession.  There are related commercial
>>> offerings like this : ) but the overhead of maintaining those offerings is
>>> pretty high.  Building good APIs within MLlib to avoid copying logic across
>>> libraries will be well worth it.
>>>
>>> We've talked about this need at Databricks and have also been syncing
>>> with the creators of MLeap.  It'd be great to get this functionality into
>>> Spark itself.  Some thoughts:
>>> * It'd be valuable to have this go beyond adding transform() methods
>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>> local, lightweight versions of models in mllib-local, outside of the main
>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>> utilize elements of Spark SQL, particularly Rows and Types, which could be
>>> moved into a local sql package.
>>> * This architecture may require some awkward APIs currently to have
>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>> and regular (DataFrame-friendly) model classes in mllib.  We might find it
>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>> architecture while making it feasible for 3rd party developers to extend
>>> MLlib APIs (especially in Java).
>>>
>> I agree this could be interesting, and feed into the other discussion
>> around when (or if) we should be considering Spark 3.0
>> I _think_ we could probably do it with optional traits people could mix
>> in to avoid breaking the current APIs but I could be wrong on that point.
>>
>>> * It could also be worth discussing local DataFrames.  They might not be
>>> as important as per-Row transformations, but they would be helpful for
>>> batching for higher throughput.
>>>
>> That could be interesting as well.
>>
>>>
>>> I'll be interested to hear others' thoughts too!
>>>
>>> Joseph
>>>
>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
>>> wrote:
>>>
 Hi y'all,

 With the renewed interest in ML in Apache Spark now seems like a good a
 time as any to revisit the online serving situation in Spark ML. DB &
 other's have done some excellent working moving a lot of the necessary
 tools into a local linear algebra package that doesn't depend on having a
 SparkContext.

 There are a few different commercial and 

Re: Possible SPIP to improve matrix and vector column type support

2018-05-12 Thread Leif Walsh
I filed an SPIP for this at
https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss!

On Wed, Apr 18, 2018 at 23:33 Leif Walsh <leif.wa...@gmail.com> wrote:

> I agree we should reuse as much as possible. For PySpark, I think the
> obvious choices of Breeze and numpy arrays already made make a lot of
> sense, I’m not sure about the other language bindings and would defer to
> others.
>
> I was under the impression that UDTs were gone and (probably?) not coming
> back. Did I miss something and they’re actually going to be better
> supported in the future? I think your second point (about separating
> expanding the primitives from expanding SQL support) is only really true if
> we’re getting UDTs back.
>
> You’ve obviously seen more of the history here than me. Do you have a
> sense of why the efforts you mentioned never went anywhere? I don’t think
> this is strictly about “mllib local”, it’s more about generic linalg, so
> 19653 feels like the closest to what I’m after, but it looks to me like
> that one just fizzled out, rather than a real back and forth.
>
> Does this just need something like a persistent product manager to scope
> out the effort, champion it, and push it forward?
> On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jos...@databricks.com>
> wrote:
>
>> Thanks for the thoughts!  We've gone back and forth quite a bit about
>> local linear algebra support in Spark.  For reference, there have been some
>> discussions here:
>> https://issues.apache.org/jira/browse/SPARK-6442
>> https://issues.apache.org/jira/browse/SPARK-16365
>> https://issues.apache.org/jira/browse/SPARK-19653
>>
>> Overall, I like the idea of improving linear algebra support, especially
>> given the rise of Python numerical processing & deep learning.  But some
>> considerations I'd list include:
>> * There are great linear algebra libraries out there, and it would be
>> ideal to reuse those as much as possible.
>> * SQL support for linear algebra can be a separate effort from expanding
>> linear algebra primitives.
>> * It would be valuable to discuss external types as UDTs (which can be
>> hacked with numpy and scipy types now) vs. adding linear algebra types to
>> native Spark SQL.
>>
>>
>> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <leif.wa...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>>> I’ve found myself wanting more.
>>>
>>> There is a minor issue in that with the arrow serialization enabled,
>>> these types don’t serialize properly in python UDF calls or in toPandas.
>>> There’s a natural representation for them in numpy.ndarray, and I’ve
>>> started a conversation with the arrow community about supporting
>>> tensor-valued columns, but that might be a ways out. In the meantime, I
>>> think we can fix this by using the FixedSizeBinary column type in arrow,
>>> together with some metadata describing the tensor shape (list of dimension
>>> sizes).
>>>
>>> The larger issue, for which I intend to submit an SPIP soon, is that
>>> these types could be better supported at the API layer, regardless of
>>> serialization. In the limit, we could consider the entire numpy ndarray
>>> surface area as a target. At the minimum, what I’m thinking is that these
>>> types should support column operations like matrix multiply, transpose,
>>> inner and outer product, etc., and maybe have a more ergonomic construction
>>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>>> VectorAssembler API is kind of clunky.
>>>
>>> One possibility here is to restrict the tensor column types such that
>>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>>> for operations to check validity before execution, for example, a matrix
>>> multiply could check dimension match and fail fast. However, there might be
>>> use cases for a column to contain variable shape tensors, I’m open to
>>> discussion here.
>>>
>>> What do you all think?
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
> --
> --
> Cheers,
> Leif
>
-- 
-- 
Cheers,
Leif


Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Leif Walsh
I agree we should reuse as much as possible. For PySpark, I think the
obvious choices of Breeze and numpy arrays already made make a lot of
sense, I’m not sure about the other language bindings and would defer to
others.

I was under the impression that UDTs were gone and (probably?) not coming
back. Did I miss something and they’re actually going to be better
supported in the future? I think your second point (about separating
expanding the primitives from expanding SQL support) is only really true if
we’re getting UDTs back.

You’ve obviously seen more of the history here than me. Do you have a sense
of why the efforts you mentioned never went anywhere? I don’t think this is
strictly about “mllib local”, it’s more about generic linalg, so 19653
feels like the closest to what I’m after, but it looks to me like that one
just fizzled out, rather than a real back and forth.

Does this just need something like a persistent product manager to scope
out the effort, champion it, and push it forward?
On Wed, Apr 18, 2018 at 20:02 Joseph Bradley <jos...@databricks.com> wrote:

> Thanks for the thoughts!  We've gone back and forth quite a bit about
> local linear algebra support in Spark.  For reference, there have been some
> discussions here:
> https://issues.apache.org/jira/browse/SPARK-6442
> https://issues.apache.org/jira/browse/SPARK-16365
> https://issues.apache.org/jira/browse/SPARK-19653
>
> Overall, I like the idea of improving linear algebra support, especially
> given the rise of Python numerical processing & deep learning.  But some
> considerations I'd list include:
> * There are great linear algebra libraries out there, and it would be
> ideal to reuse those as much as possible.
> * SQL support for linear algebra can be a separate effort from expanding
> linear algebra primitives.
> * It would be valuable to discuss external types as UDTs (which can be
> hacked with numpy and scipy types now) vs. adding linear algebra types to
> native Spark SQL.
>
>
> On Wed, Apr 11, 2018 at 7:53 PM, Leif Walsh <leif.wa...@gmail.com> wrote:
>
>> Hi all,
>>
>> I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
>> I’ve found myself wanting more.
>>
>> There is a minor issue in that with the arrow serialization enabled,
>> these types don’t serialize properly in python UDF calls or in toPandas.
>> There’s a natural representation for them in numpy.ndarray, and I’ve
>> started a conversation with the arrow community about supporting
>> tensor-valued columns, but that might be a ways out. In the meantime, I
>> think we can fix this by using the FixedSizeBinary column type in arrow,
>> together with some metadata describing the tensor shape (list of dimension
>> sizes).
>>
>> The larger issue, for which I intend to submit an SPIP soon, is that
>> these types could be better supported at the API layer, regardless of
>> serialization. In the limit, we could consider the entire numpy ndarray
>> surface area as a target. At the minimum, what I’m thinking is that these
>> types should support column operations like matrix multiply, transpose,
>> inner and outer product, etc., and maybe have a more ergonomic construction
>> API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
>> VectorAssembler API is kind of clunky.
>>
>> One possibility here is to restrict the tensor column types such that
>> every value must have the same shape, e.g. a 2x2 matrix. This would allow
>> for operations to check validity before execution, for example, a matrix
>> multiply could check dimension match and fail fast. However, there might be
>> use cases for a column to contain variable shape tensors, I’m open to
>> discussion here.
>>
>> What do you all think?
>> --
>> --
>> Cheers,
>> Leif
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>
-- 
-- 
Cheers,
Leif


Possible SPIP to improve matrix and vector column type support

2018-04-11 Thread Leif Walsh
Hi all,

I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and
I’ve found myself wanting more.

There is a minor issue in that with the arrow serialization enabled, these
types don’t serialize properly in python UDF calls or in toPandas. There’s
a natural representation for them in numpy.ndarray, and I’ve started a
conversation with the arrow community about supporting tensor-valued
columns, but that might be a ways out. In the meantime, I think we can fix
this by using the FixedSizeBinary column type in arrow, together with some
metadata describing the tensor shape (list of dimension sizes).

The larger issue, for which I intend to submit an SPIP soon, is that these
types could be better supported at the API layer, regardless of
serialization. In the limit, we could consider the entire numpy ndarray
surface area as a target. At the minimum, what I’m thinking is that these
types should support column operations like matrix multiply, transpose,
inner and outer product, etc., and maybe have a more ergonomic construction
API like df.withColumn(‘feature’, Vectors.of(‘list’, ‘of’, ‘cols’)), the
VectorAssembler API is kind of clunky.

One possibility here is to restrict the tensor column types such that every
value must have the same shape, e.g. a 2x2 matrix. This would allow for
operations to check validity before execution, for example, a matrix
multiply could check dimension match and fail fast. However, there might be
use cases for a column to contain variable shape tensors, I’m open to
discussion here.

What do you all think?
-- 
-- 
Cheers,
Leif