Re: Parquet schema changes

2015-01-07 Thread Adam Gilmore
Fantastic - glad to see that it's in the pipeline! On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust mich...@databricks.com wrote: I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore

Re: Parquet schema changes

2015-01-06 Thread Adam Gilmore
Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single part (i.e. file) in the parquet directory, so in theory it should be possible. Effectively, our use case is that we have a stack of JSON that we receive and we want to encode to Parquet

Re: Parquet schema changes

2015-01-06 Thread Michael Armbrust
I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore dragoncu...@gmail.com wrote: Anyone got any further thoughts on this? I saw the _metadata file seems to store the schema of every single

Re: Parquet schema changes

2015-01-04 Thread Adam Gilmore
I saw that in the source, which is why I was wondering. I was mainly reading: http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/ A query that tries to parse the organizationId and userId from the 2 logTypes should be able to do so correctly, though they are positioned differently

Parquet schema changes

2014-12-21 Thread Adam Gilmore
Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second