Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training.
Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand <abhis.anan...@gmail.com>: > Hi , > > I want to run glm variant of sparkR for my data that is there in a csv > file. > > I see that the glm function in sparkR takes a spark dataframe as input. > > Now, when I read a file from csv and create a spark dataframe, how could I > take care of the factor variables/columns in my data ? > > Do I need to convert it to a R dataframe, convert to factor using > as.factor and create spark dataframe and run glm over it ? > > But, running as.factor over big dataset is not possible. > > Please suggest what is the best way to acheive this ? > > What pre-processing should be done, and what is the best way to achieve it > ? > > > Thanks, > Abhi >