I face the similar issue in Spark 1.2. Cache the schema RDD takes about 50s for 400MB data. The schema is similar to the TPC-H LineItem.
Here is the code I tried the cache. I am wondering if there is any setting missing? Thank you so much! lineitemSchemaRDD.registerTempTable("lineitem"); sqlContext.sqlContext().cacheTable("lineitem"); System.out.println(lineitemSchemaRDD.count()); On Mon, Apr 6, 2015 at 8:00 PM, Christian Perez <christ...@svds.com> wrote: > Hi all, > > Has anyone else noticed very slow time to cache a Parquet file? It > takes 14 s per 235 MB (1 block) uncompressed node local Parquet file > on M2 EC2 instances. Or are my expectations way off... > > Cheers, > > Christian > > -- > Christian Perez > Silicon Valley Data Science > Data Analyst > christ...@svds.com > @cp_phd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Wenlei Xie (谢文磊) Ph.D. Candidate Department of Computer Science 456 Gates Hall, Cornell University Ithaca, NY 14853, USA Email: wenlei....@gmail.com