Another possible workaround, when creating an ML pipeline with PySpark and
Python's H2O API, would be to first convert the PySpark dataframes to H2O
dataframes, then do the following:
1. Create a new dataframe from the feature dataframe using
drop_duplicates (call this group_df), with
Has anyone else used PySpark dataframes in conjunction with H2O for ML
pipelining, and have had to use custom folds to keep rows/observations of
the same group (e.g. user account, vehicle, city or whatever) in the same
validation fold, so as to prevent data leakage during cross-validation??
On
Idea: PySpark function to create fold indices (numbers from 0, ..., N-1,
where N := number of folds needed for k-fold CV during auto ML training) on
train & test datasets
```
# train & test are PySpark dataframes of the train & test datasets
respectively
import pyspark.sql.functions as F
from