Hi, I save Parquet files in a partitioned table, so in a path looking like /path/to/table/myfield=a/ . But I also kept the field "myfield" in the Parquet data. Thus. when I query the field, I get this error:
df.select("myfield").show(10) "Exception in thread "main" org.apache.spark.sql.AnalysisException: Ambiguous references to myfield (myfield#2,List()),(myfield#47,List());" Looking at the code, I could not find a way to explicitly specify which column I'd want. DataFrame#columns returns strings. Even by loading the data with a schema (StructType), I'm not sure I can do it. Should I have to make sure that my partition field does not exist in the data before saving ? Or is there a way to declare what column in the schema I want to query ? Also, for the same reasons, if I try to persist() the data, I get this error: * Caused by: java.lang.ArrayIndexOutOfBoundsException: 3 at parquet.bytes.BytesUtils.bytesToInt(BytesUtils.java:227) at parquet.column.statistics.IntStatistics.setMinMaxFromBytes(IntStatistics.java:46) at parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249) at parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:558) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:492) at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:116) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Ambiguous-references-to-a-field-set-in-a-partitioned-table-AND-the-data-tp22325.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org