Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will
explore ways of doing this using an agg and UDF.
On Fri, Oct 30, 2020 at 6:26 AM Patrick McCarthy
wrote:
> That's a very large vector. Is it sparse? Perhaps you'd have better luck
> performing an aggregate instead of a pivot, and assembling the vector using
> a UDF.
>
> On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef
> wrote:
>
>> Hello,
>>
>> I have a very large long-format dataframe (several billion rows) that I'd
>> like to pivot and vectorize (using the VectorAssembler), with the aim to
>> reduce dimensionality using something akin to TF-IDF. Once pivoted, the
>> dataframe will have ~130 million columns.
>>
>> The source, long-format schema looks as follows:
>>
>> root
>> |-- entity_id: long (nullable = false)
>> |-- attribute_id: long (nullable = false)
>> |-- event_count: integer (nullable = true)
>>
>> Pivoting as per the following fails, exhausting executor and driver
>> memory. I am unsure whether increasing memory limits would be successful
>> here as my sense is that pivoting and then using a VectorAssembler isn't
>> the right approach to solving this problem.
>>
>> wide_frame = (
>> long_frame.groupBy("entity_id")
>> .pivot("attribute_id")
>> .agg(F.first("event_count"))
>> )
>>
>> Are there other Spark patterns that I should attempt in order to achieve
>> my end goal of a vector of attributes for every entity?
>>
>> Thanks, Daniel
>>
>
>
> --
>
>
> *Patrick McCarthy *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>