Re: SparkR DataFrame Column Casts esp. from CSV Files

Shivaram Venkataraman Wed, 03 Jun 2015 12:21:40 -0700

I created https://issues.apache.org/jira/browse/SPARK-8085 for this.


On Wed, Jun 3, 2015 at 12:12 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Hmm - the schema=myschema doesn't seem to work in SparkR from my simple
> local test. I'm filing a JIRA for this now
>
> On Wed, Jun 3, 2015 at 11:04 AM, Eskilson,Aleksander <
> alek.eskil...@cerner.com> wrote:
>
>>  Neat, thanks for the info Hossein. My use case was just to reset the
>> schema for a CSV dataset, but if either a. I can specify it at load, or b.
>> it will be inferred in the future, I’ll likely not need to cast columns,
>> much less reset the whole schema. I’ll still file a JIRA for the
>> capability, but with lower priority.
>>
>>  —Alek
>>
>>   From: Hossein Falaki <hoss...@databricks.com>
>> Date: Wednesday, June 3, 2015 at 12:55 PM
>> To: "shiva...@eecs.berkeley.edu" <shiva...@eecs.berkeley.edu>
>> Cc: Aleksander Eskilson <alek.eskil...@cerner.com>, "dev@spark.apache.org"
>> <dev@spark.apache.org>
>> Subject: Re: SparkR DataFrame Column Casts esp. from CSV Files
>>
>>   Yes, spark-csv does not infer types yet, but it is planned to be
>> implemented soon.
>>
>>  To work around the current limitations (of spark-csv and SparkR), you
>> can specify the schema in read.df() to get your desired types from
>> spark-csv. For example:
>>
>>  myschema <- structType(structField(“id", "integer"),
>> structField(“name", "string”), structField(“location”, “string”))
>> df <- read.df(sqlContext, "path/to/file.csv", source =
>> “com.databricks.spark.csv”, schema = myschema)
>>
>>  —Hossein
>>
>>  On Jun 3, 2015, at 10:29 AM, Shivaram Venkataraman <
>> shiva...@eecs.berkeley.edu> wrote:
>>
>>  cc Hossein who knows more about the spark-csv options
>>
>>  You are right that the default CSV reader options end up creating all
>> columns as string. I know that the JSON reader infers the schema [1] but I
>> don't know if the CSV reader has any options to do that.  Regarding the
>> SparkR syntax to cast columns, I think there is a simpler way to do it by
>> just assigning to the same column name. For example I have a flights
>> DataFrame with the `year` column typed as string. To cast it to int I just
>> use
>>
>>  flights$year <- cast(flights$year, "int")
>>
>>  Now the dataframe has the same number of columns as before and you
>> don't need a selection.
>>
>>  However this still doesn't address the part about casting multiple
>> columns -- Could you file a new JIRA to track the need for casting multiple
>> columns or rather being able to set the schema after loading a DF ?
>>
>>  Thanks
>> Shivaram
>>
>>  [1]
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
>> <https://urldefense.proofpoint.com/v2/url?u=http-3A__spark.apache.org_docs_latest_sql-2Dprogramming-2Dguide.html-23json-2Ddatasets&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=HrpRObaR19Nr992p61rCA9h_44qxPkg3u3G9QPEGKcE&e=>
>>
>>
>> On Wed, Jun 3, 2015 at 7:51 AM, Eskilson,Aleksander <
>> alek.eskil...@cerner.com> wrote:
>>
>>>  It appears that casting columns remains a bit of a trick in Spark’s
>>> DataFrames. This is an issue because tools like spark-csv will set column
>>> types to String by default and will not attempt to infer types. Although
>>> spark-csv supports specifying  types for columns in its options, it’s not
>>> clear how that might be integrated into SparkR (when loading the spark-csv
>>> package into the R session).
>>>
>>>  Looking at the column.R spec we can cast a column to a different data
>>> type with the cast function [1], but it’s notable that this is not a
>>> mutator, and it returns a column object as opposed to a DataFrame. It
>>> appears the column cast can only be ‘applied’ by using the withColumn() or
>>> mutate() (an alias for withColumn).
>>>
>>>  The other way to cast with Spark DataFrames is to write UDFs that
>>> operate on a column value and return a coerced value. It looks like SparkR
>>> doesn’t have UDFs just yet [2], but it seems like they’d be necessary to do
>>> a natural one-off column cast in R, something like
>>>
>>>  df.col1toInt <- withColumn(df, “intCol1”, udf(df$col1, function(x)
>>> as.numeric(x)))
>>>
>>>  (where col1 was originally ‘character’ type)
>>>
>>>  Currently it seems one has to
>>> df.col1cast <- cast(df$col1, “int”)
>>> df.col1toInt <- withColumn(df, df.col1cast)
>>>
>>>  If we wanted just our casted columns and not the original column from
>>> the data frame, we’d still have to do a select. There was a conversation
>>> about CSV files just yesterday. Types are already problematic, but they’re
>>> a very common data source in R, even at scale.
>>>
>>>  But only being able to coerce one column at a time is really unwieldy.
>>> Can the current spark-csv SQL API for specifying types [3] be extended
>>> SparkR? And are there any thoughts on implementing some kind of type
>>> inferencing perhaps based on a sampling of some number of rows (an
>>> implementation I’ve seen before)? R’s read.csv() and read.delim() get types
>>> by inferring from the whole file. Getting something that can achieve that
>>> functionality via explicit definition of types or sampling will probably be
>>> necessary to work with CSV files that have enough columns to merit R at
>>> Spark’s scale.
>>>
>>>  Regards,
>>> Alek Eskilson
>>>
>>>  [1] - https://github.com/apache/spark/blob/master/R/pkg/R/column.R#L190
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_master_R_pkg_R_column.R-23L190&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=a_un2u_P_9iUC5QY4DQf4ayzukWk5ta9cbsGnaND3bA&e=>
>>> [2] - https://issues.apache.org/jira/browse/SPARK-6817
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_SPARK-2D6817&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=dciAX1hsR4ZvwI8BZEgLV49GX7x9Bv5c3TbZZbUnZnA&e=>
>>> [3] - https://github.com/databricks/spark-csv#sql-api
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_databricks_spark-2Dcsv-23sql-2Dapi&d=AwMFaQ&c=NRtzTzKNaCCmhN_9N2YJR-XrNU1huIgYP99yDsEzaJo&r=0vZw1rBdgaYvDJYLyKglbrax9kvQfRPdzxLUyWSyxPM&m=jttL5G8owvc7e3__uVdYKnu0D5nxr2rZnq2twPUTtyQ&s=zrmlIgWJY8jsATWoWYM9fvEVVVW9EDiWeBHTKMQpEMA&e=>
>>>
>>> CONFIDENTIALITY NOTICE This message and any included attachments are
>>> from Cerner Corporation and are intended only for the addressee. The
>>> information contained in this message is confidential and may constitute
>>> inside or non-public information under international, federal, or state
>>> securities laws. Unauthorized forwarding, printing, copying, distribution,
>>> or use of such information is strictly prohibited and may be unlawful. If
>>> you are not the addressee, please promptly delete this message and notify
>>> the sender of the delivery error by e-mail or you may call Cerner's
>>> corporate offices in Kansas City, Missouri, U.S.A at (+1) (816)221-1024.
>>>
>>
>>
>>
>

Re: SparkR DataFrame Column Casts esp. from CSV Files

Reply via email to