Hi Franklyn,

I had the same problem like yours with vectors & Maps. I tried:
1) UDF --> cumbersome and difficult to maintain. One has to re-write /
re-implement UDFs + extensive docs should be provided for colleagues +
something weird may happen when you migrate to new Spark version
2) RDD / DataFrame row-by-row approach --> slow but reliable &
understandable and you can fill-in & modify rows as you like
3) Spark SQL / DataFrame "magic tricks" approach --> it's like the code you
provided. Create DataFrames, join them, drop & modify somehow. Generally
speaking, it should be faster then approach #2
4) Encoders --> raw approach; basic types only. I chose this approach
finally. But I had to reformulate my problem in terms of basic data types
and POJO classes
5) Catalyst API Expressions --> I believe this is the right way but it
requires that you & your colleagues dive deep into Spark internals

Sincerely yours, Timur

On Fri, Jun 23, 2017 at 12:55 PM, Franklyn D'souza <
franklyn.dso...@shopify.com> wrote:

> As a reference this is what is required to coalesce a vector column in
> pyspark.
>
> df = sc.sql.createDataFrame([(SparseVector(10,{1:44}),), (None,),
> (SparseVector(10,{1:23}),), (None,), (SparseVector(10,{1:35}),)],
> schema=schema
> empty_vector = sc.sql.createDataFrame([(SparseVector(10, {}),)],
> schema=schema)
> df = df.crossJoin(empty_vector)
> df = df.withColumn('feature', F.coalesce('feature', '_empty_vector')
>
>
>
> On Thu, Jun 22, 2017 at 11:54 AM, Franklyn D'souza <
> franklyn.dso...@shopify.com> wrote:
>
>> We've developed Scala UDFs internally to address some of these issues and
>> we'd love to upstream them back to spark. Just trying to figure out what
>> the vector support looks like on the road map.
>>
>> would it be best to put this functionality into the Imputer,
>> VectorAssembler or maybe try to give it more of a first class support in
>> dataframes by having it work with the lit column expression.
>>
>> On Wed, Jun 21, 2017 at 9:30 PM, Franklyn D'souza <
>> franklyn.dso...@shopify.com> wrote:
>>
>>> From the documentation it states that ` The input columns should be of
>>> DoubleType or FloatType.` so i dont think that is what im looking for.
>>> Also in general the API around vectors is highly lacking, especially from
>>> the pyspark side.
>>>
>>> Very common vector operations like addition, subtractions and dot
>>> products can't be performed. I'm wondering what the direction is with
>>> vector support in spark.
>>>
>>> On Wed, Jun 21, 2017 at 9:19 PM, Maciej Szymkiewicz <
>>> mszymkiew...@gmail.com> wrote:
>>>
>>>> Since 2.2 there is Imputer:
>>>>
>>>> https://github.com/apache/spark/blob/branch-2.2/examples/src
>>>> /main/python/ml/imputer_example.py
>>>>
>>>> which should at least partially address the problem.
>>>>
>>>> On 06/22/2017 03:03 AM, Franklyn D'souza wrote:
>>>> > I just wanted to highlight some of the rough edges around using
>>>> > vectors in columns in dataframes.
>>>> >
>>>> > If there is a null in a dataframe column containing vectors pyspark ml
>>>> > models like logistic regression will completely fail.
>>>> >
>>>> > However from what i've read there is no good way to fill in these
>>>> > nulls with empty vectors.
>>>> >
>>>> > Its not possible to create a literal vector column expressiong and
>>>> > coalesce it with the column from pyspark.
>>>> >
>>>> > so we're left with writing a python udf which does this coalesce, this
>>>> > is really inefficient on large datasets and becomes a bottleneck for
>>>> > ml pipelines working with real world data.
>>>> >
>>>> > I'd like to know how other users are dealing with this and what plans
>>>> > there are to extend vector support for dataframes.
>>>> >
>>>> > Thanks!,
>>>> >
>>>> > Franklyn
>>>>
>>>> --
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Reply via email to