Hi Franklyn, I had the same problem like yours with vectors & Maps. I tried: 1) UDF --> cumbersome and difficult to maintain. One has to re-write / re-implement UDFs + extensive docs should be provided for colleagues + something weird may happen when you migrate to new Spark version 2) RDD / DataFrame row-by-row approach --> slow but reliable & understandable and you can fill-in & modify rows as you like 3) Spark SQL / DataFrame "magic tricks" approach --> it's like the code you provided. Create DataFrames, join them, drop & modify somehow. Generally speaking, it should be faster then approach #2 4) Encoders --> raw approach; basic types only. I chose this approach finally. But I had to reformulate my problem in terms of basic data types and POJO classes 5) Catalyst API Expressions --> I believe this is the right way but it requires that you & your colleagues dive deep into Spark internals
Sincerely yours, Timur On Fri, Jun 23, 2017 at 12:55 PM, Franklyn D'souza < franklyn.dso...@shopify.com> wrote: > As a reference this is what is required to coalesce a vector column in > pyspark. > > df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,), > (SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)], > schema=schema > empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)], > schema=schema) > df = df.crossJoin(empty_vector) > df = df.withColumn('feature', F.coalesce('feature', '_empty_vector') > > > > On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza < > franklyn.dso...@shopify.com> wrote: > >> We've developed Scala UDFs internally to address some of these issues and >> we'd love to upstream them back to spark. Just trying to figure out what >> the vector support looks like on the road map. >> >> would it be best to put this functionality into the Imputer, >> VectorAssembler or maybe try to give it more of a first class support in >> dataframes by having it work with the lit column expression. >> >> On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza < >> franklyn.dso...@shopify.com> wrote: >> >>> From the documentation it states that ` The input columns should be of >>> DoubleType or FloatType.` so i dont think that is what im looking for. >>> Also in general the API around vectors is highly lacking, especially from >>> the pyspark side. >>> >>> Very common vector operations like addition, subtractions and dot >>> products can't be performed. I'm wondering what the direction is with >>> vector support in spark. >>> >>> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz < >>> mszymkiew...@gmail.com> wrote: >>> >>>> Since 2.2 there is Imputer: >>>> >>>> https://github.com/apache/spark/blob/branch-2.2/examples/src >>>> /main/python/ml/imputer_example.py >>>> >>>> which should at least partially address the problem. >>>> >>>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote: >>>> > I just wanted to highlight some of the rough edges around using >>>> > vectors in columns in dataframes. >>>> > >>>> > If there is a null in a dataframe column containing vectors pyspark ml >>>> > models like logistic regression will completely fail. >>>> > >>>> > However from what i've read there is no good way to fill in these >>>> > nulls with empty vectors. >>>> > >>>> > Its not possible to create a literal vector column expressiong and >>>> > coalesce it with the column from pyspark. >>>> > >>>> > so we're left with writing a python udf which does this coalesce, this >>>> > is really inefficient on large datasets and becomes a bottleneck for >>>> > ml pipelines working with real world data. >>>> > >>>> > I'd like to know how other users are dealing with this and what plans >>>> > there are to extend vector support for dataframes. >>>> > >>>> > Thanks!, >>>> > >>>> > Franklyn >>>> >>>> -- >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>> >>>> >>> >> >