Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Patrick McCarthy Fri, 30 Oct 2020 06:27:24 -0700

That's a very large vector. Is it sparse? Perhaps you'd have better luck
performing an aggregate instead of a pivot, and assembling the vector using
a UDF.


On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef
<daniel.cha...@sparkpost.com.invalid> wrote:

> Hello,
>
> I have a very large long-format dataframe (several billion rows) that I'd
> like to pivot and vectorize (using the VectorAssembler), with the aim to
> reduce dimensionality using something akin to TF-IDF. Once pivoted, the
> dataframe will have ~130 million columns.
>
> The source, long-format schema looks as follows:
>
> root
>  |-- entity_id: long (nullable = false)
>  |-- attribute_id: long (nullable = false)
>  |-- event_count: integer (nullable = true)
>
> Pivoting as per the following fails, exhausting executor and driver
> memory. I am unsure whether increasing memory limits would be successful
> here as my sense is that pivoting and then using a VectorAssembler isn't
> the right approach to solving this problem.
>
> wide_frame = (
>     long_frame.groupBy("entity_id")
>     .pivot("attribute_id")
>     .agg(F.first("event_count"))
> )
>
> Are there other Spark patterns that I should attempt in order to achieve
> my end goal of a vector of attributes for every entity?
>
> Thanks, Daniel
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

Reply via email to