Parquet is faster for adhoc queries, because of the columnar storage. (It only reads the columns needed for a query.) It's more than twice as fast (often a lot more) as Avro in these slides from SVDS: http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015 Slides 25-32.
It's fast on Impala, too. In fact, it was designed with Impala in mind, Hive is also supported. Like ORC, it is self describing with a schema. Avro is very flexible for schema evolution. It allows adding, renaming, and deleting columns. Parquet only supports adding columns. So, that's a tradeoff. Also, Parquet is optimized for reads, but slower on writes (some performance numbers in the slides above). Xinh On Thu, Mar 3, 2016 at 8:48 PM, Jong Wook Kim <ilike...@gmail.com> wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the file, which makes it > handy to work with any other tools. > > Jong Wook > > On 3 March 2016 at 23:29, Don Drake <dondr...@gmail.com> wrote: > >> My tests show Parquet has better performance than Avro in just about >> every test. It really shines when you are querying a subset of columns in >> a wide table. >> >> -Don >> >> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann <tim.sp...@airisdata.com> >> wrote: >> >>> Which format is the best format for SparkSQL adhoc queries and general >>> data storage? >>> >>> There are lots of specialized cases, but generally accessing some but >>> not all the available columns with a reasonable subset of the data. >>> >>> I am learning towards Parquet as it has great support in Spark. >>> >>> I also have to consider any file on HDFS may be accessed from other >>> tools like Hive, Impala, HAWQ. >>> >>> Suggestions? >>> — >>> airis.DATA >>> Timothy Spann, Senior Solutions Architect >>> C: 609-250-5894 >>> http://airisdata.com/ >>> http://meetup.com/nj-datascience >>> >>> >>> >> >> >> -- >> Donald Drake >> Drake Consulting >> http://www.drakeconsulting.com/ >> https://twitter.com/dondrake <http://www.MailLaunder.com/> >> 800-733-2143 >> > >