Re: Avro storage strategy?

Stefán Baxter Tue, 08 Mar 2016 04:14:15 -0800

Hi,

We use Avro to store/accumulate/badge streaming data and then we migrate it
to Parquet.


We then use union queries to merge fresh and historical data (Avro +
Parquet)

Things to keep in mind (AFAIK):

   - Avro is a lot slower and more inefficient, storage space and
   performance wise, than Parquet
   - We migrate our Afro records to Parquet every 24 hours

   - The mr-parquet-avro library will not create drill compatible Parquet
   if you are using nested structures in Avro (in some cases)
   - Use Drill to convert your Avro files into Drill

   - Avro is missing date support and maintaining compatible schema between
   Avro and Parquet can be a bit tricky (depending on structure)

   - The Avro Drill plugin does not support Directory Pruning.
   - We use that to limit the files scanned when dealing with date-rage
   queries

   - We have been dealing with a lot of issues with Avro
   - We hope the remainder of them is fixed in the imminent 1.6 release of
   Drill

   - Parquet are not suited for frequent updated (Streaming inserts)

   - If you are getting strange query results then immediately assume it's
   the Avro plugin
   - This will hopefully save you some time otherwise spent on second
   guessing/verifying your data

Hope this helps.

Regards,
 -Stefán


On Tue, Mar 8, 2016 at 11:58 AM, Conrad Crampton <
conrad.cramp...@secdata.com> wrote:

> Hi (new here),
> I have a plan to use Drill to provide a sql abstraction layer (as an
> alternative to Hive). I like what I see so far, but I am a bit in the dark
> on Avro support. Whilst support for Avro is mentioned (almost in passing)
> in the documentation, there is very little details on its use in practice
> as opposed to Parquet references. I am using Apache NiFi to move data
> around and as final resting place Avro data on HDFS (as Nifi supports this
> nicely out of the box). I therefore want to use Drill to query this, but
> the tests I have done so far seem very slow when querying any substantial
> amount of avro data directly with Drill.
>
> I am looking for some pointers on how best to do this – my idea was to
> have my data in avro (well defined schema), partitioned into HDFS
> directory/ sub directories but simple select * from `/location` limit 100
> takes forever (many minutes). Am I to assume that I need to create tables/
> views on top of the raw data for Drill to optimise its queries and if so,
> it doesn’t need to re-run these as batch jobs to update them?
>
> Any pointers/ documentations/ blog links that would be welcome.
>
> Thanks
> Conrad
>
>
> SecureData, combating cyber threats
> ______________________________________________________________________
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email
> are those of the individual and not necessarily of SecureData Europe Ltd.
> Any prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT
>

Re: Avro storage strategy?

Reply via email to