[ https://issues.apache.org/jira/browse/SPARK-6432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371035#comment-14371035 ]
zzc commented on SPARK-6432: ---------------------------- [~huangjs], I have some parquet files in partitions path, as follow: 2015-03-18 10:38 /zzc_test/parquetbypartitons/puser=impala.data1 2015-03-18 10:39 /zzc_test/parquetbypartitons/puser=impala.data2 Load dataset correctly: root |-- ltype: integer (nullable = false) |-- chan: string (nullable = false) |-- ts: integer (nullable = false) |-- cip: string (nullable = false) |-- rt: string (nullable = false) |-- date: string (nullable = false) |-- time: string (nullable = false) |-- host: string (nullable = false) |-- ratio: integer (nullable = false) |-- size: long (nullable = false) |-- code: integer (nullable = false) |-- dltime: long (nullable = false) |-- cache: string (nullable = false) |-- bsize: long (nullable = false) |-- upsize: long (nullable = false) |-- url: string (nullable = false) |-- referer: string (nullable = false) |-- ua: string (nullable = false) |-- *puser: string (nullable = true)* |-- pdate: string (nullable = true) |-- pslice: string (nullable = true) |-- pcache: integer (nullable = true) > Cannot load parquet data with partitions if not all partition columns match > data columns > ---------------------------------------------------------------------------------------- > > Key: SPARK-6432 > URL: https://issues.apache.org/jira/browse/SPARK-6432 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.3.0, 1.3.1 > Reporter: Jianshi Huang > Assignee: Cheng Lian > > Suppose we have a dataset in the following folder structure: > {noformat} > parquet/source=live/date=2015-03-18/ > parquet/source=live/date=2015-03-19/ > ... > {noformat} > And the data schema has the following columns: > - id > - *event_date* > - source > - value > Where partition key source matches data column source, but partition key date > doesn't match any columns in data. > Then we cannot load dataset in Spark using parquetFile. It reports: > {code} > org.apache.spark.sql.AnalysisException: Ambiguous references to source: > (source#2,List()),(source#5,List()); > ... > {code} > Currently if partition columns has overlaps with data columns, partition > columns have to be a subset of the data columns. > Jianshi -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org