Issue with Parquet on Spark 1.2 and Amazon EMR

2014-12-21 Thread Adam Gilmore
Hi all, I've just launched a new Amazon EMR cluster and used the script at: s3://support.elasticmapreduce/spark/install-spark to install Spark (this script was upgraded to support 1.2). I know there are tools to launch a Spark cluster in EC2, but I want to use EMR. Everything installs fine; ho

Parquet schema changes

2014-12-21 Thread Adam Gilmore
Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second SchemaRDD

Re: Parquet schema changes

2015-01-04 Thread Adam Gilmore
ngle Parquet file (which is an HDFS > directory with multiple part-files) are identical. > > On 12/22/14 1:11 PM, Adam Gilmore wrote: > >Hi all, > > I understand that parquet allows for schema versioning automatically in > the format; however, I'm not sure whether S

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Adam Gilmore
Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why. When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine. On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore wrote: > Hi all, > >

Parquet predicate pushdown

2015-01-05 Thread Adam Gilmore
Hi all, I have a question regarding predicate pushdown for Parquet. My understanding was this would use the metadata in Parquet's blocks/pages to skip entire chunks that won't match without needing to decode the values and filter on every value in the table. I was testing a scenario where I had

Re: Parquet predicate pushdown

2015-01-06 Thread Adam Gilmore
s/latest/sql-programming-guide.html#configuration > > On Mon, Jan 5, 2015 at 3:38 PM, Adam Gilmore > wrote: > >> Hi all, >> >> I have a question regarding predicate pushdown for Parquet. >> >> My understanding was this would use the metadata in Parquet's &g

Re: Parquet schema changes

2015-01-06 Thread Adam Gilmore
for high performance, but there is potential of new fields being added to the JSON structure, so we want to be able to handle that every time we encode to Parquet (we'll be doing it "incrementally" for performance). On Mon, Jan 5, 2015 at 3:44 PM, Adam Gilmore wrote: > I saw

Re: Parquet schema changes

2015-01-07 Thread Adam Gilmore
Fantastic - glad to see that it's in the pipeline! On Wed, Jan 7, 2015 at 11:27 AM, Michael Armbrust wrote: > I want to support this but we don't yet. Here is the JIRA: > https://issues.apache.org/jira/browse/SPARK-3851 > > On Tue, Jan 6, 2015 at 5:23 PM, Adam Gilmore &g