It appears that casting columns remains a bit of a trick in Spark’s DataFrames. This is an issue because tools like spark-csv will set column types to String by default and will not attempt to infer types. Although spark-csv supports specifying types for columns in its options, it’s not clear how that might be integrated into SparkR (when loading the spark-csv package into the R session).
Looking at the column.R spec we can cast a column to a different data type with the cast function [1], but it’s notable that this is not a mutator, and it returns a column object as opposed to a DataFrame. It appears the column cast can only be ‘applied’ by using the withColumn() or mutate() (an alias for withColumn). The other way to cast with Spark DataFrames is to write UDFs that operate on a column value and return a coerced value. It looks like SparkR doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do a natural one-off column cast in R, something like df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) as.numeric(x))) (where col1 was originally ‘character’ type) Currently it seems one has to df.col1cast <- cast(df$col1, “int”) df.col1toInt <- withColumn(df, df.col1cast) If we wanted just our casted columns and not the original column from the data frame, we’d still have to do a select. There was a conversation about CSV files just yesterday. Types are already problematic, but they’re a very common data source in R, even at scale. But only being able to coerce one column at a time is really unwieldy. Can the current spark-csv SQL API for specifying types [3] be extended SparkR? And are there any thoughts on implementing some kind of type inferencing perhaps based on a sampling of some number of rows (an implementation I’ve seen before)? R’s read.csv() and read.delim() get types by inferring from the whole file. Getting something that can achieve that functionality via explicit definition of types or sampling will probably be necessary to work with CSV files that have enough columns to merit R at Spark’s scale. Regards, Alek Eskilson [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 [2] - https://issues.apache.org/jira/browse/SPARK-6817 [3] - https://github.com/databricks/spark-csv#sql-api CONFIDENTIALITY NOTICE This message and any included attachments are from Cerner Corporation and are intended only for the addressee. The information contained in this message is confidential and may constitute inside or non-public information under international, federal, or state securities laws. Unauthorized forwarding, printing, copying, distribution, or use of such information is strictly prohibited and may be unlawful. If you are not the addressee, please promptly delete this message and notify the sender of the delivery error by e-mail or you may call Cerner's corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.