Parquet is faster for adhoc queries, because of the columnar storage. (It
only reads the columns needed for a query.) It's more than twice as fast
(often a lot more) as Avro in these slides from SVDS:
http://www.slideshare.net/StampedeCon/choosing-an-hdfs-data-storage-format-avro-vs-parquet-and-more-stampedecon-2015
Slides 25-32.

It's fast on Impala, too. In fact, it was designed with Impala in mind,
Hive is also supported.

Like ORC, it is self describing with a schema.

Avro is very flexible for schema evolution. It allows adding, renaming, and
deleting columns. Parquet only supports adding columns. So, that's a
tradeoff. Also, Parquet is optimized for reads, but slower on writes (some
performance numbers in the slides above).

Xinh

On Thu, Mar 3, 2016 at 8:48 PM, Jong Wook Kim <ilike...@gmail.com> wrote:

> How about ORC? I have experimented briefly with Parquet and ORC, and I
> liked the fact that ORC has its schema within the file, which makes it
> handy to work with any other tools.
>
> Jong Wook
>
> On 3 March 2016 at 23:29, Don Drake <dondr...@gmail.com> wrote:
>
>> My tests show Parquet has better performance than Avro in just about
>> every test.  It really shines when you are querying a subset of columns in
>> a wide table.
>>
>> -Don
>>
>> On Wed, Mar 2, 2016 at 3:49 PM, Timothy Spann <tim.sp...@airisdata.com>
>> wrote:
>>
>>> Which format is the best format for SparkSQL adhoc queries and general
>>> data storage?
>>>
>>> There are lots of specialized cases, but generally accessing some but
>>> not all the available columns with a reasonable subset of the data.
>>>
>>> I am learning towards Parquet as it has great support in Spark.
>>>
>>> I also have to consider any file on HDFS may be accessed from other
>>> tools like Hive, Impala, HAWQ.
>>>
>>> Suggestions?
>>> —
>>> airis.DATA
>>> Timothy Spann, Senior Solutions Architect
>>> C: 609-250-5894
>>> http://airisdata.com/
>>> http://meetup.com/nj-datascience
>>>
>>>
>>>
>>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>> 800-733-2143
>>
>
>

Reply via email to