Re: Train ML models on each partition

2019-05-09 Thread Dillon Dukek
Hi Qian, The way that I have gotten around this type of problem in the past is to do a groupBy on the dimensions that you want to build a model for and then initialize, and train a model using a package like scikit learn for each group in something like a group map pandas udf. If you need these

Train ML models on each partition

2019-05-08 Thread Qian He
I have a 1TB dataset with 100 columns. The first column is a user_id, there are about 1000 unique user_ids in this 1TB dataset. The use case: I want to train a ML model for each user_id on this user's records (approximately 1GB records per user). Say the ML model is a Decision Tree. But it is not