Re: Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-11 Thread Chester Gan
Another possible workaround, when creating an ML pipeline with PySpark and Python's H2O API, would be to first convert the PySpark dataframes to H2O dataframes, then do the following: 1. Create a new dataframe from the feature dataframe using drop_duplicates (call this group_df), with

Re: Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-11 Thread Chester Gan
Has anyone else used PySpark dataframes in conjunction with H2O for ML pipelining, and have had to use custom folds to keep rows/observations of the same group (e.g. user account, vehicle, city or whatever) in the same validation fold, so as to prevent data leakage during cross-validation?? On

Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-06 Thread Chester Gan
Idea: PySpark function to create fold indices (numbers from 0, ..., N-1, where N := number of folds needed for k-fold CV during auto ML training) on train & test datasets ``` # train & test are PySpark dataframes of the train & test datasets respectively import pyspark.sql.functions as F from