Re: Helper methods for PySpark discussion

2018-10-27 Thread Leif Walsh
h a note in the docstring >> >> I am not sure about this because force evaluation could be something that >> has side effect. For example, df.count() can realize a cache and if we >> implement __len__ to call df.count() then len(df) would end up populating >> some cach

Re: Helper methods for PySpark discussion

2018-10-26 Thread Leif Walsh
That all sounds reasonable but I think in the case of 4 and maybe also 3 I would rather see it implemented to raise an error message that explains what’s going on and suggests the explicit operation that would do the most equivalent thing. And perhaps raise a warning (using the warnings module)

Re: Python friendly API for Spark 3.0

2018-09-17 Thread Leif Walsh
I agree with Reynold, at some point you’re going to run into the parts of the pandas API that aren’t distributable. More feature parity will be good, but users are still eventually going to hit a feature cliff. Moreover, it’s not just the pandas API that people want to use, but also the set of

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Leif Walsh
Hey there, Here’s something I proposed recently that’s in this space. https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24258 It’s motivated by working with a user who wanted to do some custom statistics for which they could write the numpy code, and knew in what dimensions they

Re: Revisiting Online serving of Spark models?

2018-05-22 Thread Leif Walsh
I’m with you on json being more readable than parquet, but we’ve had success using pyarrow’s parquet reader and have been quite happy with it so far. If your target is python (and probably if not now, then soon, R), you should look in to it. On Mon, May 21, 2018 at 16:52 Joseph Bradley

Re: Possible SPIP to improve matrix and vector column type support

2018-05-12 Thread Leif Walsh
I filed an SPIP for this at https://issues.apache.org/jira/browse/SPARK-24258. Let’s discuss! On Wed, Apr 18, 2018 at 23:33 Leif Walsh <leif.wa...@gmail.com> wrote: > I agree we should reuse as much as possible. For PySpark, I think the > obvious choices of Breeze and numpy arrays

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Leif Walsh
can be a separate effort from expanding > linear algebra primitives. > * It would be valuable to discuss external types as UDTs (which can be > hacked with numpy and scipy types now) vs. adding linear algebra types to > native Spark SQL. > > > On Wed, Apr 11, 2018 at 7:53 PM,

Possible SPIP to improve matrix and vector column type support

2018-04-11 Thread Leif Walsh
Hi all, I’ve been playing around with the Vector and Matrix UDTs in pyspark.ml and I’ve found myself wanting more. There is a minor issue in that with the arrow serialization enabled, these types don’t serialize properly in python UDF calls or in toPandas. There’s a natural representation for