[ https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371039#comment-14371039 ]
Cheng Lian commented on SPARK-6432: ----------------------------------- The problem is that, if all partition columns appeared in the path exist in the data files, it's fine. But if only some of the partition columns exist in the data file, it ends up with duplicated columns. You case belongs to the first category. > Cannot load parquet data with partitions if not all partition columns match > data columns > ---------------------------------------------------------------------------------------- > > Key: SPARK-6432 > URL: https://issues.apache.org/jira/browse/SPARK-6432 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Jianshi Huang > Assignee: Cheng Lian > > Suppose we have a dataset in the following folder structure: > {noformat} > parquet/source=live/date=2015-03-18/ > parquet/source=live/date=2015-03-19/ > ... > {noformat} > And the data schema has the following columns: > - id > - *event_date* > - source > - value > Where partition key source matches data column source, but partition key date > doesn't match any columns in data. > Then we cannot load dataset in Spark using parquetFile. It reports: > {code} > org.apache.spark.sql.AnalysisException: Ambiguous references to source: > (source#2,List()),(source#5,List()); > ... > {code} > Currently if partition columns has overlaps with data columns, partition > columns have to be a subset of the data columns. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org