Re: Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-11 Thread Chester Gan
Another possible workaround, when creating an ML pipeline with PySpark and Python's H2O API, would be to first convert the PySpark dataframes to H2O dataframes, then do the following: 1. Create a new dataframe from the feature dataframe using drop_duplicates (call this group_df), with colum

Re: Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-11 Thread Chester Gan
Has anyone else used PySpark dataframes in conjunction with H2O for ML pipelining, and have had to use custom folds to keep rows/observations of the same group (e.g. user account, vehicle, city or whatever) in the same validation fold, so as to prevent data leakage during cross-validation?? On Fri

Re: pyspark loop optimization

2022-01-11 Thread Gourav Sengupta
Hi, I am not sure what you are trying to achieve here are cume_dist and percent_rank not different? If am able to follow your question correctly, you are looking for filtering our NULLs before applying the function on the dataframe, and I think it will be fine if you just create another dataframe

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
Thanks! I will take a look Andy From: Gourav Sengupta Date: Tuesday, January 11, 2022 at 8:42 AM To: Andrew Davidson Cc: Andrew Davidson , "user @spark" Subject: Re: How to add a row number column with out reordering my data frame Hi, I do not think we need to do any of that. Please try rep

[Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-11 Thread Alana Young
I am experimenting with creating and persisting ML pipelines using custom transformers (I am using Spark 3.1.2). I was able to create a transformer class (for testing purposes, I modeled the code off the SQLTransformer class) and save the pipeline model. When I attempt to load the saved pipeline

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Gourav Sengupta
Hi, I do not think we need to do any of that. Please try repartitionbyrange, dpark 3 has adaptive query execution with configurations to handle skew as well. Regards, Gourav On Tue, Jan 11, 2022 at 4:21 PM Andrew Davidson wrote: > HI Gourav > > > > When I join I get OOM. To address this my thou

Re: How to add a row number column with out reordering my data frame

2022-01-11 Thread Andrew Davidson
HI Gourav When I join I get OOM. To address this my thought was to split my tables into small batches of rows. And then join the batch together then use union. My assumption is the union is a narrow transform and as such require fewer resources. Let say I have 5 data frames I want to join toget

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan
Hopefully, this StackOverflow answer can solve your problem: https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default Spark doesn't control the behavior of qualifying paths. It's decided by certain configs and/or config files. On Tue, Jan 11, 2022 at 3:0