Re: Parquet schema changes

2015-01-07 Thread Adam Gilmore
Fantastic - glad to see that it's in the pipeline! On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust mich...@databricks.com wrote: I want to support this but we don't yet. Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3851 On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore

Re: Parquet schema changes

2015-01-06 Thread Adam Gilmore
for high performance, but there is potential of new fields being added to the JSON structure, so we want to be able to handle that every time we encode to Parquet (we'll be doing it incrementally for performance). On Mon, Jan 5, 2015 at 3:44 PM, Adam Gilmore dragoncu...@gmail.com wrote: I saw

Re: Parquet predicate pushdown

2015-01-06 Thread Adam Gilmore
-programming-guide.html#configuration On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore dragoncu...@gmail.com wrote: Hi all, I have a question regarding predicate pushdown for Parquet. My understanding was this would use the metadata in Parquet's blocks/pages to skip entire chunks that won't match

Parquet predicate pushdown

2015-01-05 Thread Adam Gilmore
Hi all, I have a question regarding predicate pushdown for Parquet. My understanding was this would use the metadata in Parquet's blocks/pages to skip entire chunks that won't match without needing to decode the values and filter on every value in the table. I was testing a scenario where I had

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Adam Gilmore
Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why. When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine. On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore dragoncu...@gmail.com wrote

Re: Parquet schema changes

2015-01-04 Thread Adam Gilmore
) are identical. On 12/22/14 1:11 PM, Adam Gilmore wrote: Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing

Issue with Parquet on Spark 1.2 and Amazon EMR

2014-12-21 Thread Adam Gilmore
Hi all, I've just launched a new Amazon EMR cluster and used the script at: s3://support.elasticmapreduce/spark/install-spark to install Spark (this script was upgraded to support 1.2). I know there are tools to launch a Spark cluster in EC2, but I want to use EMR. Everything installs fine;

Parquet schema changes

2014-12-21 Thread Adam Gilmore
Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second