I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851
On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore <dragoncu...@gmail.com> wrote: > Anyone got any further thoughts on this? I saw the _metadata file seems > to store the schema of every single part (i.e. file) in the parquet > directory, so in theory it should be possible. > > Effectively, our use case is that we have a stack of JSON that we receive > and we want to encode to Parquet for high performance, but there is > potential of new fields being added to the JSON structure, so we want to be > able to handle that every time we encode to Parquet (we'll be doing it > "incrementally" for performance). > > On Mon, Jan 5, 2015 at 3:44 PM, Adam Gilmore <dragoncu...@gmail.com> > wrote: > >> I saw that in the source, which is why I was wondering. >> >> I was mainly reading: >> >> http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/ >> >> "A query that tries to parse the organizationId and userId from the 2 >> logTypes should be able to do so correctly, though they are positioned >> differently in the schema. With Parquet, it’s not a problem. It will merge >> ‘A’ and ‘V’ schemas and project columns accordingly. It does so by >> maintaining a file schema in addition to merged schema and parsing the >> columns by referencing the 2." >> >> I know that each part file can have its own schema, but I saw in the >> implementation for Spark, if there was no metadata file, it'd just pick the >> first file and use that schema across the board. I'm not quite sure how >> other implementations like Impala etc. deal with this, but I was really >> hoping there'd be a way to "version" the schema as new records are added >> and just project it through. >> >> Would be a godsend for semi-structured data. >> >> On Tue, Dec 23, 2014 at 3:33 PM, Cheng Lian <lian.cs....@gmail.com> >> wrote: >> >>> I must missed something important here, could you please provide more >>> clue on Parquet “schema versioning”? I wasn’t aware of this feature (which >>> sounds really useful). >>> >>> Especially, are you referring the following scenario: >>> >>> 1. Write some data whose schema is A to “t.parquet”, resulting a >>> file “t.parquet/parquet-r-1.part” on HDFS >>> 2. Append more data whose schema B “contains” A, but has more >>> columns to “t.parquet”, resulting another file >>> “t.parquet/parquet-r-2.part” >>> on HDFS >>> 3. Now read “t.parquet”, and schema A and B are expected to be merged >>> >>> If this is the case, then current Spark SQL doesn’t support this. We >>> assume schemas of all data within a single Parquet file (which is an HDFS >>> directory with multiple part-files) are identical. >>> >>> On 12/22/14 1:11 PM, Adam Gilmore wrote: >>> >>> Hi all, >>> >>> I understand that parquet allows for schema versioning automatically >>> in the format; however, I'm not sure whether Spark supports this. >>> >>> I'm saving a SchemaRDD to a parquet file, registering it as a table, >>> then doing an insertInto with a SchemaRDD with an extra column. >>> >>> The second SchemaRDD does in fact get inserted, but the extra column >>> isn't present when I try to query it with Spark SQL. >>> >>> Is there anything I can do to get this working how I'm hoping? >>> >>> >>> >> >> >