Hi,

We are using Avro, JSON and Parquet for collection various types of data
for analytical processing.

I have not used Parquet before we starting to play around with Drill and
now I'm wondering if we are planing our data structures correctly and if we
will be able to get the most out of Drill+Parquet.

I have some questions and I hope the answers can be turned into a Best
Practices document.

So here we go:

   - Are there any rules that we must abide by to make scanning of
   "low-cardinality" columns as effective as they can be?
   - I understand it so that the Parquet dictionary is scanned for the
   value(s) and if they are not in the dictionary that the section is ignored

   - Can dictionary based scanning (as described above) work on arrays?
   - like: {"some":"simple","tags":["blue","green","yellow"]}

   - If I have multiple files containing a days worth of logging, in
   chronological order, will all the irrelevant files be ignored when looking
   for a data or a date range?
   - AKA - Will the min-max headers in Parquet be used to prevent scanning
   of data outside the range?

   - Is there anything I need to do to make sure that the write
   optimizations in Parquet are used?
   - dictionaries for low cardinality fields
   - "number folding" for numerical sequences
   - compression etc.

   - Are there any Parquet features that are not available in Parquet?
   - I know Drill is using a fork of Parquet and I wonder if any major
   improvements in parquet are unavailable

   - Storing Dates with timezone information (stored in two separate
   fields?)
   - What is the common approach?

   - Are there any caveats in converting Avro to Parquet?
   - other than to convert unix dates from Avor (only long available) into
   timsetamp fields in Parquet


There will, in all likelihood, be future installment to this entry as new
questions arise.

All help is appreciated.

Regards,
 -Stefan

Reply via email to