[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

mgdadv (JIRA) Thu, 12 Mar 2015 02:59:31 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357610#comment-14357610
 ]


mgdadv commented on SPARK-6189:
-------------------------------

While the dot is legal in R and SQL, I don't think there is a nice way of 
making it
legal in python. So at least in the Spark python code, I think something should
be done about it.

I just realized that the automatic renaming can cause problems if that entry
already exists.  For example, what if GNP_deflator was already in the data set
and then GNP.deflator gets changed.

I think the best thing to do is to just warn the user by printing out a warning
message. I have changed the patch accordingly.

Here is some example code for pyspark:

import pandas as pd
df = pd.read_csv(StringIO.StringIO("a.b,a,c\n101,102,103\n201,202,203"))
spdf = sqlCtx.createDataFrame(df)
spdf.take(2)
spdf[spdf.a==102].take(2)

So far this works, but this fails:
spdf[spdf.a.b==101].take(2)

In pandas df.a.b doesn't work either, but the fields can be accessed via the 
string "a.b", i.e.:
df["a.b"]


> Pandas to DataFrame conversion should check field names for periods
> -------------------------------------------------------------------
>
>                 Key: SPARK-6189
>                 URL: https://issues.apache.org/jira/browse/SPARK-6189
>             Project: Spark
>          Issue Type: Improvement
>          Components: DataFrame, SQL
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Priority: Minor
>
> Issue I ran into:  I imported an R dataset in CSV format into a Pandas 
> DataFrame and then use toDF() to convert that into a Spark DataFrame.  The R 
> dataset had a column with a period in it (column "GNP.deflator" in the 
> "longley" dataset).  When I tried to select it using the Spark DataFrame DSL, 
> I could not because the DSL thought the period was selecting a field within 
> GNP.
> Also, since "GNP" is another field's name, it gives an error which could be 
> obscure to users, complaining:
> {code}
> org.apache.spark.sql.AnalysisException: GetField is not valid on fields of 
> type DoubleType;
> {code}
> We should either handle periods in column names or check during loading and 
> warn/fail gracefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6189) Pandas to DataFrame conversion should check field names for periods

Reply via email to