[ https://issues.apache.org/jira/browse/SPARK-6189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14357610#comment-14357610 ]
mgdadv commented on SPARK-6189: ------------------------------- While the dot is legal in R and SQL, I don't think there is a nice way of making it legal in python. So at least in the Spark python code, I think something should be done about it. I just realized that the automatic renaming can cause problems if that entry already exists. For example, what if GNP_deflator was already in the data set and then GNP.deflator gets changed. I think the best thing to do is to just warn the user by printing out a warning message. I have changed the patch accordingly. Here is some example code for pyspark: import pandas as pd df = pd.read_csv(StringIO.StringIO("a.b,a,c\n101,102,103\n201,202,203")) spdf = sqlCtx.createDataFrame(df) spdf.take(2) spdf[spdf.a==102].take(2) So far this works, but this fails: spdf[spdf.a.b==101].take(2) In pandas df.a.b doesn't work either, but the fields can be accessed via the string "a.b", i.e.: df["a.b"] > Pandas to DataFrame conversion should check field names for periods > ------------------------------------------------------------------- > > Key: SPARK-6189 > URL: https://issues.apache.org/jira/browse/SPARK-6189 > Project: Spark > Issue Type: Improvement > Components: DataFrame, SQL > Affects Versions: 1.3.0 > Reporter: Joseph K. Bradley > Priority: Minor > > Issue I ran into: I imported an R dataset in CSV format into a Pandas > DataFrame and then use toDF() to convert that into a Spark DataFrame. The R > dataset had a column with a period in it (column "GNP.deflator" in the > "longley" dataset). When I tried to select it using the Spark DataFrame DSL, > I could not because the DSL thought the period was selecting a field within > GNP. > Also, since "GNP" is another field's name, it gives an error which could be > obscure to users, complaining: > {code} > org.apache.spark.sql.AnalysisException: GetField is not valid on fields of > type DoubleType; > {code} > We should either handle periods in column names or check during loading and > warn/fail gracefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org