Re: SparkR DataFrame Column Casts esp. from CSV Files

Hossein Falaki Wed, 03 Jun 2015 10:59:13 -0700

Yes, spark-csv does not infer types yet, but it is planned to be implemented 
soon.


To work around the current limitations (of spark-csv and SparkR), you can 
specify the schema in read.df() to get your desired types from spark-csv. For 
example:

myschema <- structType(structField(“id", "integer"), structField(“name", 
"string”), structField(“location”, “string”))
df <- read.df(sqlContext, "path/to/file.csv", source = 
“com.databricks.spark.csv”, schema = myschema)

—Hossein

> On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman 
> <shiva...@eecs.berkeley.edu> wrote:
> 
> cc Hossein who knows more about the spark-csv options
> 
> You are right that the default CSV reader options end up creating all columns 
> as string. I know that the JSON reader infers the schema [1] but I don't know 
> if the CSV reader has any options to do that.  Regarding the SparkR syntax to 
> cast columns, I think there is a simpler way to do it by just assigning to 
> the same column name. For example I have a flights DataFrame with the `year` 
> column typed as string. To cast it to int I just use
> 
> flights$year <- cast(flights$year, "int")
> 
> Now the dataframe has the same number of columns as before and you don't need 
> a selection.
> 
> However this still doesn't address the part about casting multiple columns -- 
> Could you file a new JIRA to track the need for casting multiple columns or 
> rather being able to set the schema after loading a DF ?
> 
> Thanks
> Shivaram
> 
> [1] 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets>
>  
> 
> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <alek.eskil...@cerner.com 
> <mailto:alek.eskil...@cerner.com>> wrote:
> It appears that casting columns remains a bit of a trick in Spark’s 
> DataFrames. This is an issue because tools like spark-csv will set column 
> types to String by default and will not attempt to infer types. Although 
> spark-csv supports specifying  types for columns in its options, it’s not 
> clear how that might be integrated into SparkR (when loading the spark-csv 
> package into the R session). 
> 
> Looking at the column.R spec we can cast a column to a different data type 
> with the cast function [1], but it’s notable that this is not a mutator, and 
> it returns a column object as opposed to a DataFrame. It appears the column 
> cast can only be ‘applied’ by using the withColumn() or mutate() (an alias 
> for withColumn). 
> 
> The other way to cast with Spark DataFrames is to write UDFs that operate on 
> a column value and return a coerced value. It looks like SparkR doesn’t have 
> UDFs just yet [2], but it seems like they’d be necessary to do a natural 
> one-off column cast in R, something like
> 
> df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) 
> as.numeric(x)))
> 
> (where col1 was originally ‘character’ type)
> 
> Currently it seems one has to
> df.col1cast <- cast(df$col1, “int”)
> df.col1toInt <- withColumn(df, df.col1cast)
> 
> If we wanted just our casted columns and not the original column from the 
> data frame, we’d still have to do a select. There was a conversation about 
> CSV files just yesterday. Types are already problematic, but they’re a very 
> common data source in R, even at scale. 
> 
> But only being able to coerce one column at a time is really unwieldy. Can 
> the current spark-csv SQL API for specifying types [3] be extended SparkR? 
> And are there any thoughts on implementing some kind of type inferencing 
> perhaps based on a sampling of some number of rows (an implementation I’ve 
> seen before)? R’s read.csv() and read.delim() get types by inferring from the 
> whole file. Getting something that can achieve that functionality via 
> explicit definition of types or sampling will probably be necessary to work 
> with CSV files that have enough columns to merit R at Spark’s scale.
> 
> Regards,
> Alek Eskilson
> 
> [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 
> <https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190>
> [2] - https://issues.apache.org/jira/browse/SPARK-6817 
> <https://issues.apache.org/jira/browse/SPARK-6817>
> [3] - https://github.com/databricks/spark-csv#sql-api 
> <https://github.com/databricks/spark-csv#sql-api>
> CONFIDENTIALITY NOTICE This message and any included attachments are from 
> Cerner Corporation and are intended only for the addressee. The information 
> contained in this message is confidential and may constitute inside or 
> non-public information under international, federal, or state securities 
> laws. Unauthorized forwarding, printing, copying, distribution, or use of 
> such information is strictly prohibited and may be unlawful. If you are not 
> the addressee, please promptly delete this message and notify the sender of 
> the delivery error by e-mail or you may call Cerner's corporate offices in 
> Kansas City, Missouri, U.S.A at (+1) (816)221-1024 
> <tel:%28%2B1%29%20%28816%29221-1024>.
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Reply via email to