That's a very large vector. Is it sparse? Perhaps you'd have better luck performing an aggregate instead of a pivot, and assembling the vector using a UDF.
On Thu, Oct 29, 2020 at 10:19 PM Daniel Chalef <daniel.cha...@sparkpost.com.invalid> wrote: > Hello, > > I have a very large long-format dataframe (several billion rows) that I'd > like to pivot and vectorize (using the VectorAssembler), with the aim to > reduce dimensionality using something akin to TF-IDF. Once pivoted, the > dataframe will have ~130 million columns. > > The source, long-format schema looks as follows: > > root > |-- entity_id: long (nullable = false) > |-- attribute_id: long (nullable = false) > |-- event_count: integer (nullable = true) > > Pivoting as per the following fails, exhausting executor and driver > memory. I am unsure whether increasing memory limits would be successful > here as my sense is that pivoting and then using a VectorAssembler isn't > the right approach to solving this problem. > > wide_frame = ( > long_frame.groupBy("entity_id") > .pivot("attribute_id") > .agg(F.first("event_count")) > ) > > Are there other Spark patterns that I should attempt in order to achieve > my end goal of a vector of attributes for every entity? > > Thanks, Daniel > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016