Parquet is a column-oriented format, which means that you need to read in less data from the file system if you're only interested in a subset of your columns. Also, Parquet pushes down selection predicates, which can eliminate needless deserialization of rows that don't match a selection criterion. Other than that, you would also get compression, and likely save processor cycles when parsing lines from text files.
On Mon, Nov 24, 2014 at 8:20 AM, mrm <ma...@skimlinks.com> wrote: > Hi, > > Is there any advantage to storing data as a parquet format, loading it > using > the sparkSQL context, but never registering as a table/using sql on it? > Something like: > > Something like: > data = sqc.parquetFile(path) > results = data.map(lambda x: applyfunc(x.field)) > > Is this faster/more optimised than having the data stored as a text file > and > using Spark (non-SQL) to process it? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/advantages-of-SparkSQL-tp19661.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >