Hmm - the schema=myschema doesn't seem to work in SparkR from my simple local test. I'm filing a JIRA for this now
On Wed, Jun 3, 2015 at 11:04 AM, Eskilson,Aleksander < alek.eskil...@cerner.com> wrote: > Neat, thanks for the info Hossein. My use case was just to reset the > schema for a CSV dataset, but if either a. I can specify it at load, or b. > it will be inferred in the future, I’ll likely not need to cast columns, > much less reset the whole schema. I’ll still file a JIRA for the > capability, but with lower priority. > > —Alek > > From: Hossein Falaki <hoss...@databricks.com> > Date: Wednesday, June 3, 2015 at 12:55 PM > To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu> > Cc: Aleksander Eskilson <alek.eskil...@cerner.com>, "dev@spark.apache.org" > <dev@spark.apache.org> > Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files > > Yes, spark-csv does not infer types yet, but it is planned to be > implemented soon. > > To work around the current limitations (of spark-csv and SparkR), you > can specify the schema in read.df() to get your desired types from > spark-csv. For example: > > myschema <- structType(structField(“id", "integer"), structField(“name", > "string”), structField(“location”, “string”)) > df <- read.df(sqlContext, "path/to/file.csv", source = > “com.databricks.spark.csv”, schema = myschema) > > —Hossein > > On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman < > shiva...@eecs.berkeley.edu> wrote: > > cc Hossein who knows more about the spark-csv options > > You are right that the default CSV reader options end up creating all > columns as string. I know that the JSON reader infers the schema [1] but I > don't know if the CSV reader has any options to do that. Regarding the > SparkR syntax to cast columns, I think there is a simpler way to do it by > just assigning to the same column name. For example I have a flights > DataFrame with the `year` column typed as string. To cast it to int I just > use > > flights$year <- cast(flights$year, "int") > > Now the dataframe has the same number of columns as before and you don't > need a selection. > > However this still doesn't address the part about casting multiple > columns -- Could you file a new JIRA to track the need for casting multiple > columns or rather being able to set the schema after loading a DF ? > > Thanks > Shivaram > > [1] > http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets > <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=HrpRObaR19Nr992p61rCA9h_44qxPkg3u3G9QPEGKcE&e=> > > > On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander < > alek.eskil...@cerner.com> wrote: > >> It appears that casting columns remains a bit of a trick in Spark’s >> DataFrames. This is an issue because tools like spark-csv will set column >> types to String by default and will not attempt to infer types. Although >> spark-csv supports specifying types for columns in its options, it’s not >> clear how that might be integrated into SparkR (when loading the spark-csv >> package into the R session). >> >> Looking at the column.R spec we can cast a column to a different data >> type with the cast function [1], but it’s notable that this is not a >> mutator, and it returns a column object as opposed to a DataFrame. It >> appears the column cast can only be ‘applied’ by using the withColumn() or >> mutate() (an alias for withColumn). >> >> The other way to cast with Spark DataFrames is to write UDFs that >> operate on a column value and return a coerced value. It looks like SparkR >> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do >> a natural one-off column cast in R, something like >> >> df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x) >> as.numeric(x))) >> >> (where col1 was originally ‘character’ type) >> >> Currently it seems one has to >> df.col1cast <- cast(df$col1, “int”) >> df.col1toInt <- withColumn(df, df.col1cast) >> >> If we wanted just our casted columns and not the original column from >> the data frame, we’d still have to do a select. There was a conversation >> about CSV files just yesterday. Types are already problematic, but they’re >> a very common data source in R, even at scale. >> >> But only being able to coerce one column at a time is really unwieldy. >> Can the current spark-csv SQL API for specifying types [3] be extended >> SparkR? And are there any thoughts on implementing some kind of type >> inferencing perhaps based on a sampling of some number of rows (an >> implementation I’ve seen before)? R’s read.csv() and read.delim() get types >> by inferring from the whole file. Getting something that can achieve that >> functionality via explicit definition of types or sampling will probably be >> necessary to work with CSV files that have enough columns to merit R at >> Spark’s scale. >> >> Regards, >> Alek Eskilson >> >> [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=a_un2u_P_9iUC5QY4DQf4ayzukWk5ta9cbsGnaND3bA&e=> >> [2] - https://issues.apache.org/jira/browse/SPARK-6817 >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=dciAX1hsR4ZvwI8BZEgLV49GX7x9Bv5c3TbZZbUnZnA&e=> >> [3] - https://github.com/databricks/spark-csv#sql-api >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=zrmlIgWJY8jsATWoWYM9fvEVVVW9EDiWeBHTKMQpEMA&e=> >> >> CONFIDENTIALITY NOTICE This message and any included attachments are from >> Cerner Corporation and are intended only for the addressee. The information >> contained in this message is confidential and may constitute inside or >> non-public information under international, federal, or state securities >> laws. Unauthorized forwarding, printing, copying, distribution, or use of >> such information is strictly prohibited and may be unlawful. If you are not >> the addressee, please promptly delete this message and notify the sender of >> the delivery error by e-mail or you may call Cerner's corporate offices in >> Kansas City, Missouri, U.S.A at (+1) (816)221-1024. >> > > >