Hi,

I save Parquet files in a partitioned table, so in a path looking like
/path/to/table/myfield=a/ .
But I also kept the field "myfield" in the Parquet data. Thus. when I query
the field, I get this error:

df.select("myfield").show(10)
"Exception in thread "main" org.apache.spark.sql.AnalysisException:
Ambiguous references to myfield  (myfield#2,List()),(myfield#47,List());"

Looking at the code, I could not find a way to explicitly specify which
column I'd want. DataFrame#columns returns strings. Even by loading the data
with a schema (StructType), I'm not sure I can do it.

Should I have to make sure that my partition field does not exist in the
data before saving ? Or is there a way to declare what column in the schema
I want to query ?

Also, for the same reasons, if I try to persist() the data, I get this
error:

* Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
        at parquet.bytes.BytesUtils.bytesToInt(BytesUtils.java:227)
        at
parquet.column.statistics.IntStatistics.setMinMaxFromBytes(IntStatistics.java:46)
        at
parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:249)
        at
parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:558)
        at
parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:492)
        at
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:116)
        at
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ambiguous-references-to-a-field-set-in-a-partitioned-table-AND-the-data-tp22325.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to