Hi Qian,
The way that I have gotten around this type of problem in the past is to do
a groupBy on the dimensions that you want to build a model for and then
initialize, and train a model using a package like scikit learn for each
group in something like a group map pandas udf. If you need these
I have a 1TB dataset with 100 columns. The first column is a user_id, there
are about 1000 unique user_ids in this 1TB dataset.
The use case: I want to train a ML model for each user_id on this user's
records (approximately 1GB records per user). Say the ML model is a
Decision Tree. But it is not