Hi, We are using Avro, JSON and Parquet for collection various types of data for analytical processing.
I have not used Parquet before we starting to play around with Drill and now I'm wondering if we are planing our data structures correctly and if we will be able to get the most out of Drill+Parquet. I have some questions and I hope the answers can be turned into a Best Practices document. So here we go: - Are there any rules that we must abide by to make scanning of "low-cardinality" columns as effective as they can be? - I understand it so that the Parquet dictionary is scanned for the value(s) and if they are not in the dictionary that the section is ignored - Can dictionary based scanning (as described above) work on arrays? - like: {"some":"simple","tags":["blue","green","yellow"]} - If I have multiple files containing a days worth of logging, in chronological order, will all the irrelevant files be ignored when looking for a data or a date range? - AKA - Will the min-max headers in Parquet be used to prevent scanning of data outside the range? - Is there anything I need to do to make sure that the write optimizations in Parquet are used? - dictionaries for low cardinality fields - "number folding" for numerical sequences - compression etc. - Are there any Parquet features that are not available in Parquet? - I know Drill is using a fork of Parquet and I wonder if any major improvements in parquet are unavailable - Storing Dates with timezone information (stored in two separate fields?) - What is the common approach? - Are there any caveats in converting Avro to Parquet? - other than to convert unix dates from Avor (only long available) into timsetamp fields in Parquet There will, in all likelihood, be future installment to this entry as new questions arise. All help is appreciated. Regards, -Stefan