Re: SparkR DataFrame Column Casts esp. from CSV Files

Eskilson,Aleksander Wed, 03 Jun 2015 10:50:49 -0700

Hi Shivaram,

As far as databricks’ spark-csv API shows, it seems there’s currently only 
support for explicit definition of column types. In JSON we have nice typed 
fields, but in CSVs, all bets are off. In the SQL version of the API, it 
appears you specify the column types when you create the table you’re 
populating with CSV data.


Thanks for the clarification on individual column casting, I was missing the 
more obvious syntax.

I’ll file a JIRA for resetting the schema after loading a DF.

Thanks,
Alek


From: Shivaram Venkataraman 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, June 3, 2015 at 12:29 PM
To: Aleksander Eskilson 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files

cc Hossein who knows more about the spark-csv options

You are right that the default CSV reader options end up creating all columns 
as string. I know that the JSON reader infers the schema [1] but I don't know 
if the CSV reader has any options to do that.  Regarding the SparkR syntax to 
cast columns, I think there is a simpler way to do it by just assigning to the 
same column name. For example I have a flights DataFrame with the `year` column 
typed as string. To cast it to int I just use

flights$year <- cast(flights$year, "int")

Now the dataframe has the same number of columns as before and you don't need a 
selection.

However this still doesn't address the part about casting multiple columns -- 
Could you file a new JIRA to track the need for casting multiple columns or 
rather being able to set the schema after loading a DF ?

Thanks
Shivaram

[1] 
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets<https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=BX3MuobG748zhfm7hc_SnZA4MnFbwgFreNVEjkzkENc&e=>

On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander 
<[email protected]<mailto:[email protected]>> wrote:
It appears that casting columns remains a bit of a trick in Spark’s DataFrames. 
This is an issue because tools like spark-csv will set column types to String 
by default and will not attempt to infer types. Although spark-csv supports 
specifying  types for columns in its options, it’s not clear how that might be 
integrated into SparkR (when loading the spark-csv package into the R session).

Looking at the column.R spec we can cast a column to a different data type with 
the cast function [1], but it’s notable that this is not a mutator, and it 
returns a column object as opposed to a DataFrame. It appears the column cast 
can only be ‘applied’ by using the withColumn() or mutate() (an alias for 
withColumn).

The other way to cast with Spark DataFrames is to write UDFs that operate on a 
column value and return a coerced value. It looks like SparkR doesn’t have UDFs 
just yet [2], but it seems like they’d be necessary to do a natural one-off 
column cast in R, something like

df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) 
as.numeric(x)))

(where col1 was originally ‘character’ type)

Currently it seems one has to
df.col1cast <- cast(df$col1, “int”)
df.col1toInt <- withColumn(df, df.col1cast)

If we wanted just our casted columns and not the original column from the data 
frame, we’d still have to do a select. There was a conversation about CSV files 
just yesterday. Types are already problematic, but they’re a very common data 
source in R, even at scale.

But only being able to coerce one column at a time is really unwieldy. Can the 
current spark-csv SQL API for specifying types [3] be extended SparkR? And are 
there any thoughts on implementing some kind of type inferencing perhaps based 
on a sampling of some number of rows (an implementation I’ve seen before)? R’s 
read.csv() and read.delim() get types by inferring from the whole file. Getting 
something that can achieve that functionality via explicit definition of types 
or sampling will probably be necessary to work with CSV files that have enough 
columns to merit R at Spark’s scale.

Regards,
Alek Eskilson

[1] - 
https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=pETagpDWepAmeaxucEKv1BgoCjqqpIejSjZhXZFF_y8&e=>
[2] - 
https://issues.apache.org/jira/browse/SPARK-6817<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=tiLELUAU2Sgk680gUGLr9fR9YxEU6lJEs2e0gWenWhs&e=>
[3] - 
https://github.com/databricks/spark-csv#sql-api<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=aCZhOxAn5Iu762hWogwQK__JsZigsbLZFMaz44UcKQw&s=89QC5nymwl5GjjpMwUD--828WaTvjqik9glbCHR7T-8&e=>

CONFIDENTIALITY NOTICE This message and any included attachments are from 
Cerner Corporation and are intended only for the addressee. The information 
contained in this message is confidential and may constitute inside or 
non-public information under international, federal, or state securities laws. 
Unauthorized forwarding, printing, copying, distribution, or use of such 
information is strictly prohibited and may be unlawful. If you are not the 
addressee, please promptly delete this message and notify the sender of the 
delivery error by e-mail or you may call Cerner's corporate offices in Kansas 
City, Missouri, U.S.A at (+1) (816)221-1024<tel:%28%2B1%29%20%28816%29221-1024>.

Re: SparkR DataFrame Column Casts esp. from CSV Files

Reply via email to