[jira] [Comment Edited] (SPARK-15522) DataFrame Column Names That are Numbers aren't referenced correctly in SQL
[ https://issues.apache.org/jira/browse/SPARK-15522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897638#comment-15897638 ] Hyukjin Kwon edited comment on SPARK-15522 at 3/6/17 11:26 PM: --- We can use backticks for it as below: {code} scala> Seq(Some(1), None).toDF("82").createOrReplaceTempView("piv_qhID") scala> spark.sql("select * from piv_qhID where '82' is NULL limit 20").show() +---+ | 82| +---+ +---+ scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show() ++ | 82| ++ |null| ++ {code} It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving this JIRA. Please reopen this if I misunderstood. was (Author: hyukjin.kwon): We can use backticks for it as below: {code} scala> spark.range(10).toDF("82").createOrReplaceTempView("piv_qhID") scala> spark.sql("select * from piv_qhID where `82` is NULL limit 20").show() +---+ | 82| +---+ +---+ {code} It seems {{'82'}} or {{"82"}} was being created as a constant. I am resolving this JIRA. Please reopen this if I misunderstood. > DataFrame Column Names That are Numbers aren't referenced correctly in SQL > -- > > Key: SPARK-15522 > URL: https://issues.apache.org/jira/browse/SPARK-15522 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > The following code is run: > val pre_piv_df_a = sqlContext.sql(""" > SELECT > CASE WHEN Gender = 'M' Then 1 ELSE 0 END AS has_male, > CASE WHEN Gender = 'F' Then 1 ELSE 0 END AS has_female, > CAST(StartAge AS Double) AS StartAge_dbl, > CAST(EndAge AS Double) AS EndAge_dbl, > * > FROM alldem_union_curr > """) > .withColumn("JavaStartTimestamp", create_ts($"StartTimestamp")) > .drop("StartTimestamp").withColumnRenamed("JavaStartTimestamp", > "StartTimestamp") > .drop("StartAge").drop("EndAge") > .withColumnRenamed("StartAge_dbl", > "StartAge").withColumnRenamed("EndAge_dbl", "EndAge") > val pre_piv_df_b = pre_piv_df_a > .withColumn("media_month_cc", media_month_cc($"MediaMonth")) > .withColumn("media_week_cc", media_week_sts_cc($"StartTimestamp")) > .withColumn("media_day_cc", media_day_sts_cc($"StartTimestamp")) > .withColumn("week_day", week_day($"StartTimestamp")) > .withColumn("week_end", week_end($"StartTimestamp")) > .join(sqlContext.table("cad_nets"), $"Network" === $"nielsen_network", > "inner") > .withColumnRenamed("cad_network", "norm_net_code_a") > .withColumn("norm_net_code", reCodeNets($"norm_net_code_a")) > pre_piv_df_b.registerTempTable("pre_piv_df") > val piv_qhID_df = pre_piv_df_b.groupBy("Network", "Audience", "StartDate", > "rating_category_cd") > .pivot("qaID").agg("rating" -> "mean") > The pivot creates a lot of columns (96) with names that are like > ‘01’,’02’,…,’96’ as a result of pivoting a table that has quarter hour IDs. > In the below SQL the highlighted section causes problems. If I rename the > columns to ‘col01’,’col02’,…,’col96’ I can run the SQL correctly and get the > expected results. > select * from piv_qhID where 82 is NULL limit 20 > And I am getting no rows even though there are nulls. > On the other hand the query: > select * from piv_qhID where 82 is NOT NULL limit 20 > Returns all rows (even those with nulls) > Renaming the columns fixes this, but it would be nice if the columns were > referenced correctly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15522) DataFrame Column Names That are Numbers aren't referenced correctly in SQL
[ https://issues.apache.org/jira/browse/SPARK-15522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15300346#comment-15300346 ] Huaxin Gao edited comment on SPARK-15522 at 5/25/16 4:25 PM: - I think it's not legal to have solely digit as the column name unless it's quoted. Could you please try select * from piv_qhID where "82" is NULL limit 20 or select * from piv_qhID where '82' is NULL limit 20? I think select * from piv_qhID where 82 is NOT NULL limit 20 will always return all the rows because 82 is NOT NULL is always true. I believe 82 is treated as a number, not a column name. was (Author: huaxing): I think it's not legal to have solely digit as the column name unless it's quoted. Could you please try **select * from piv_qhID where "82" is NULL limit 20** or ** select * from piv_qhID where '82' is NULL limit 20**? I think **select * from piv_qhID where 82 is NOT NULL limit 20** will always return all the rows because **82 is NOT NULL** is always true. I believe 82 is treated as a number, not a column name. > DataFrame Column Names That are Numbers aren't referenced correctly in SQL > -- > > Key: SPARK-15522 > URL: https://issues.apache.org/jira/browse/SPARK-15522 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jason Pohl > > The following code is run: > val pre_piv_df_a = sqlContext.sql(""" > SELECT > CASE WHEN Gender = 'M' Then 1 ELSE 0 END AS has_male, > CASE WHEN Gender = 'F' Then 1 ELSE 0 END AS has_female, > CAST(StartAge AS Double) AS StartAge_dbl, > CAST(EndAge AS Double) AS EndAge_dbl, > * > FROM alldem_union_curr > """) > .withColumn("JavaStartTimestamp", create_ts($"StartTimestamp")) > .drop("StartTimestamp").withColumnRenamed("JavaStartTimestamp", > "StartTimestamp") > .drop("StartAge").drop("EndAge") > .withColumnRenamed("StartAge_dbl", > "StartAge").withColumnRenamed("EndAge_dbl", "EndAge") > val pre_piv_df_b = pre_piv_df_a > .withColumn("media_month_cc", media_month_cc($"MediaMonth")) > .withColumn("media_week_cc", media_week_sts_cc($"StartTimestamp")) > .withColumn("media_day_cc", media_day_sts_cc($"StartTimestamp")) > .withColumn("week_day", week_day($"StartTimestamp")) > .withColumn("week_end", week_end($"StartTimestamp")) > .join(sqlContext.table("cad_nets"), $"Network" === $"nielsen_network", > "inner") > .withColumnRenamed("cad_network", "norm_net_code_a") > .withColumn("norm_net_code", reCodeNets($"norm_net_code_a")) > pre_piv_df_b.registerTempTable("pre_piv_df") > val piv_qhID_df = pre_piv_df_b.groupBy("Network", "Audience", "StartDate", > "rating_category_cd") > .pivot("qaID").agg("rating" -> "mean") > The pivot creates a lot of columns (96) with names that are like > ‘01’,’02’,…,’96’ as a result of pivoting a table that has quarter hour IDs. > In the below SQL the highlighted section causes problems. If I rename the > columns to ‘col01’,’col02’,…,’col96’ I can run the SQL correctly and get the > expected results. > select * from piv_qhID where 82 is NULL limit 20 > And I am getting no rows even though there are nulls. > On the other hand the query: > select * from piv_qhID where 82 is NOT NULL limit 20 > Returns all rows (even those with nulls) > Renaming the columns fixes this, but it would be nice if the columns were > referenced correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org