Re: Handling nulls in vector columns is non-trivial

2017-06-23 Thread Franklyn D'souza
=schema) df = df.crossJoin(empty_vector) df = df.withColumn('feature', F.coalesce('feature', '_empty_vector') On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > We've developed Scala UDFs internally to address some of these issues and &g

Re: Handling nulls in vector columns is non-trivial

2017-06-22 Thread Franklyn D'souza
it more of a first class support in dataframes by having it work with the lit column expression. On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > From the documentation it states that ` The input columns should be of > DoubleType or FloatType.` so

Re: Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
es/src/main/python/ml/imputer_example.py > > which should at least partially address the problem. > > On 06/22/2017 03:03 AM, Franklyn D'souza wrote: > > I just wanted to highlight some of the rough edges around using > > vectors in columns in dataframes. > > > > If

Handling nulls in vector columns is non-trivial

2017-06-21 Thread Franklyn D'souza
I just wanted to highlight some of the rough edges around using vectors in columns in dataframes. If there is a null in a dataframe column containing vectors pyspark ml models like logistic regression will completely fail. However from what i've read there is no good way to fill in these nulls

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-19 Thread Franklyn D'souza
-1 https://issues.apache.org/jira/browse/SPARK-18589 hasn't been resolved by this release and is a blocker in our adoption of spark 2.0. I've updated the issue with some steps to reproduce the error. On Mon, Dec 19, 2016 at 4:37 AM, Sean Owen wrote: > PS, here are the open

Spark Assembly jar ?

2016-06-14 Thread Franklyn D'souza
Just wondering where the spark-assembly jar has gone in 2.0. i've been reading that its been removed but i'm not sure what the new workflow is .

Can't use UDFs with Dataframes in spark-2.0-preview scala-2.10

2016-06-07 Thread Franklyn D'souza
I've built spark-2.0-preview (8f5a04b) with scala-2.10 using the following > > > ./dev/change-version-to-2.10.sh > ./dev/make-distribution.sh -DskipTests -Dzookeeper.version=3.4.5 > -Dcurator.version=2.4.0 -Dscala-2.10 -Phadoop-2.6 -Pyarn -Phive and then ran the following code in a pyspark

Can't compile 2.0-preview with scala 2.10

2016-06-06 Thread Franklyn D'souza
Hi, I've checked out the 2.0-preview and attempted to build it with ./dev/make-distribution.sh -Pscala-2.10 However i keep getting [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @ spark-parent_2.11 --- [WARNING] Rule 0: org.apache.maven.plugins.enforcer.BannedDependencies

Nulls getting converted to 0 with spark 2.0 SNAPSHOT

2016-03-07 Thread Franklyn D'souza
Just wanted to confirm that this is the expected behaviour. Basically I'm putting nulls into a non-nullable LongType column and doing a transformation operation on that column, the result is a column with nulls converted to 0. Heres an example from pyspark.sql import types from pyspark.sql

Operations on DataFrames with User Defined Types in pyspark

2016-02-11 Thread Franklyn D'souza
I'm using the UDT api to work with a custom Money datatype in dataframes. heres how i have it setup class StringUDT(UserDefinedType): @classmethod def sqlType(self): return StringType() @classmethod def module(cls): return cls.__module__ @classmethod