Dear list, I am trying to run some regression models with big data set using sparklyr. Some of the explanatory variables (Xs) in my model are categorical variables, they have to be converted into dummy codes before the analysis. I understand that in spark columns need to be treated as string type and ft_one_hot_encoder to the dummy code, there are some discussions online, however, I could not figure out how to properly write the code, could you give me some suggestions please? Thank you very much.
The code looks as below: > sc_mtcars%>%ft_string_indexer("gear","gear1")%>%ft_one_hot_encoder("gear1","gear2")%>%ml_linear_regression(hp~gear1+wt) > Formula: hp ~ gear1 + wt Coefficients: (Intercept) gear1 wt -78.38285 36.41416 62.17596 As you can see, it seems "ft_one_hot_encoder("gear1","gear2”)” didn’t work, otherwise there should be two coefficients for gear2. Any idea what when wrong? One more thing, there are some earlier posts online showing regression results with significance test info (standard errors and p values), is there any way to extract these info with the latest release of sparklyr? standard error, maybe? Thank you very much. Best regards, YA. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org