Yes, spark-csv does not infer types yet, but it is planned to be implemented soon.
To work around the current limitations (of spark-csv and SparkR), you can specify the schema in read.df() to get your desired types from spark-csv. For example: myschema <- structType(structField(“id", "integer"), structField(“name", "string”), structField(“location”, “string”)) df <- read.df(sqlContext, "path/to/file.csv", source = “com.databricks.spark.csv”, schema = myschema) —Hossein > On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman > <shiva...@eecs.berkeley.edu> wrote: > > cc Hossein who knows more about the spark-csv options > > You are right that the default CSV reader options end up creating all columns > as string. I know that the JSON reader infers the schema [1] but I don't know > if the CSV reader has any options to do that. Regarding the SparkR syntax to > cast columns, I think there is a simpler way to do it by just assigning to > the same column name. For example I have a flights DataFrame with the `year` > column typed as string. To cast it to int I just use > > flights$year <- cast(flights$year, "int") > > Now the dataframe has the same number of columns as before and you don't need > a selection. > > However this still doesn't address the part about casting multiple columns -- > Could you file a new JIRA to track the need for casting multiple columns or > rather being able to set the schema after loading a DF ? > > Thanks > Shivaram > > [1] > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > <http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets> > > > On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <alek.eskil...@cerner.com > <mailto:alek.eskil...@cerner.com>> wrote: > It appears that casting columns remains a bit of a trick in Spark’s > DataFrames. This is an issue because tools like spark-csv will set column > types to String by default and will not attempt to infer types. Although > spark-csv supports specifying types for columns in its options, it’s not > clear how that might be integrated into SparkR (when loading the spark-csv > package into the R session). > > Looking at the column.R spec we can cast a column to a different data type > with the cast function [1], but it’s notable that this is not a mutator, and > it returns a column object as opposed to a DataFrame. It appears the column > cast can only be ‘applied’ by using the withColumn() or mutate() (an alias > for withColumn). > > The other way to cast with Spark DataFrames is to write UDFs that operate on > a column value and return a coerced value. It looks like SparkR doesn’t have > UDFs just yet [2], but it seems like they’d be necessary to do a natural > one-off column cast in R, something like > > df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) > as.numeric(x))) > > (where col1 was originally ‘character’ type) > > Currently it seems one has to > df.col1cast <- cast(df$col1, “int”) > df.col1toInt <- withColumn(df, df.col1cast) > > If we wanted just our casted columns and not the original column from the > data frame, we’d still have to do a select. There was a conversation about > CSV files just yesterday. Types are already problematic, but they’re a very > common data source in R, even at scale. > > But only being able to coerce one column at a time is really unwieldy. Can > the current spark-csv SQL API for specifying types [3] be extended SparkR? > And are there any thoughts on implementing some kind of type inferencing > perhaps based on a sampling of some number of rows (an implementation I’ve > seen before)? R’s read.csv() and read.delim() get types by inferring from the > whole file. Getting something that can achieve that functionality via > explicit definition of types or sampling will probably be necessary to work > with CSV files that have enough columns to merit R at Spark’s scale. > > Regards, > Alek Eskilson > > [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 > <https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190> > [2] - https://issues.apache.org/jira/browse/SPARK-6817 > <https://issues.apache.org/jira/browse/SPARK-6817> > [3] - https://github.com/databricks/spark-csv#sql-api > <https://github.com/databricks/spark-csv#sql-api> > CONFIDENTIALITY NOTICE This message and any included attachments are from > Cerner Corporation and are intended only for the addressee. The information > contained in this message is confidential and may constitute inside or > non-public information under international, federal, or state securities > laws. Unauthorized forwarding, printing, copying, distribution, or use of > such information is strictly prohibited and may be unlawful. If you are not > the addressee, please promptly delete this message and notify the sender of > the delivery error by e-mail or you may call Cerner's corporate offices in > Kansas City, Missouri, U.S.A at (+1) (816)221-1024 > <tel:%28%2B1%29%20%28816%29221-1024>. >