Hi again,

Are incrimental timestamp values (long) being encoded in Parquet as
incremental values?
(This option in parquet to refrain from storing complete numbers and store
only the delta between numbers to save space)

Regards,
 -Stefan

On Mon, Nov 2, 2015 at 5:54 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi Aman,
>
> Thank you for this information.
>
> I'm not sure I understand this correctly but...
>
>    1. Preventing scan where values are out of range is a core feature in
>    Parquet
>    - is this not used/supported in Drill... for sure?
>
>    2. Who can tell if Dictionary encoding ...
>    A) works as expected?
>    B) is used in scanning to prevent pointless searches?
>
> New questions that have emerged:
>
>    - How efficient is the Drill use of Parquet compared to Presto?
>    - Can other solutions use Parquet files created by Drill when
>    partition pruning is used? (footer information)
>    - What is the status of the Parquet fork and getting Drill "on track"
>    in that regard?
>
> Regards,
>  -Stefan
>
> On Mon, Nov 2, 2015 at 4:44 PM, Aman Sinha <amansi...@apache.org> wrote:
>
>> > >    - If I have multiple files containing a days worth of logging, in
>> > >    chronological order, will all the irrelevant files be ignored when
>> > looking
>> > >    for a data or a date range?
>> > >    - AKA - Will the min-max headers in Parquet be used to prevent
>> > >    scanning of data outside the range?
>>
>> If these files were created using CTAS auto-partitioning, Drill will apply
>> partition pruning to eliminate files that are not needed.  If they were
>> not
>> created this way, currently the files are not eliminated...this is
>> something that should be addressed by DRILL-1950; this has been on the
>> list
>> of important JIRAs and hopefully will get addressed in one of the upcoming
>> releases.
>>
>> > >    - Is there anything I need to do to make sure that the write
>> > >    optimizations in Parquet are used?
>> > >    - dictionaries for low cardinality fields
>>
>> Note that the default value of store.parquet.enable_dictionary_encoding
>> is
>> current False.   There were some issues with dictionary encoding in the
>> past with certain data types such as date;  I thought they were fixed...we
>> should discuss on the Drill dev list about whether this can be enabled
>> (there is substantial testing effort to make sure the encoding works for
>> all data types).
>>
>> Aman
>>
>>
>>
>>
>> On Sun, Nov 1, 2015 at 8:25 AM, John Omernik <j...@omernik.com> wrote:
>>
>> > I've read through your post and had similar thoughts on trying to get
>> > together information around Parquet files .  I feel like it would be
>> really
>> > helpful for to have a section of the Drill User Docs dedicated to user
>> > stories on Parquet files.  I know stories sound odd to put into
>> > documentation, but I think that the challenge of explaining
>> optimization of
>> > something like Parquet is you can either do it from a dry academic
>> point of
>> > view, which can be hard for the user base to really understand, or you
>> can
>> > try to provide lots of stories that could be annotated by devs or
>> improved
>> > with links to other stories.
>> >
>> > What I mean by stories is example data and how it it queried, and why it
>> > was stored (with partitions based on directories, options for "loading"
>> > data into directories, using partitions within the files, how Parquet
>> > optimizes so folks know where to put extra effort into typing etc.)
>> >
>> > As to your specific questions, I can't myself answer them,I've wondered
>> > about some myself, but haven't gotten to asking them. My experiences
>> with
>> > Parquet have bene generally positive, but have involved a good amount of
>> > trial and error (as you can see from some of my user posts) (also, the
>> user
>> > group has been great, but to my point about user stories, my education
>> has
>> > come from posting stories and getting feedback from the community, it
>> would
>> > neat to see this as a first class part of documentation, as I think it
>> > could help folks with Parquet, Drill and optimizing their environment.)
>> >
>> > Wish I could be of more help beyond +1 :)
>> >
>> >
>> >
>> > On Sun, Nov 1, 2015 at 1:48 AM, Stefán Baxter <
>> ste...@activitystream.com>
>> > wrote:
>> >
>> > > So we are off to a flying start :)
>> > >
>> > > On Thu, Oct 29, 2015 at 9:50 PM, Stefán Baxter <
>> > ste...@activitystream.com>
>> > > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > We are using Avro, JSON and Parquet for collection various types of
>> > data
>> > > > for analytical processing.
>> > > >
>> > > > I have not used Parquet before we starting to play around with Drill
>> > and
>> > > > now I'm wondering if we are planing our data structures correctly
>> and
>> > if
>> > > we
>> > > > will be able to get the most out of Drill+Parquet.
>> > > >
>> > > > I have some questions and I hope the answers can be turned into a
>> Best
>> > > > Practices document.
>> > > >
>> > > > So here we go:
>> > > >
>> > > >    - Are there any rules that we must abide by to make scanning of
>> > > >    "low-cardinality" columns as effective as they can be?
>> > > >    - I understand it so that the Parquet dictionary is scanned for
>> the
>> > > >    value(s) and if they are not in the dictionary that the section
>> is
>> > > ignored
>> > > >
>> > > >    - Can dictionary based scanning (as described above) work on
>> arrays?
>> > > >    - like: {"some":"simple","tags":["blue","green","yellow"]}
>> > > >
>> > > >    - If I have multiple files containing a days worth of logging, in
>> > > >    chronological order, will all the irrelevant files be ignored
>> when
>> > > looking
>> > > >    for a data or a date range?
>> > > >    - AKA - Will the min-max headers in Parquet be used to prevent
>> > > >    scanning of data outside the range?
>> > > >
>> > > >    - Is there anything I need to do to make sure that the write
>> > > >    optimizations in Parquet are used?
>> > > >    - dictionaries for low cardinality fields
>> > > >    - "number folding" for numerical sequences
>> > > >    - compression etc.
>> > > >
>> > > >    - Are there any Parquet features that are not available in
>> Parquet?
>> > > >    - I know Drill is using a fork of Parquet and I wonder if any
>> major
>> > > >    improvements in parquet are unavailable
>> > > >
>> > > >    - Storing Dates with timezone information (stored in two separate
>> > > >    fields?)
>> > > >    - What is the common approach?
>> > > >
>> > > >    - Are there any caveats in converting Avro to Parquet?
>> > > >    - other than to convert unix dates from Avor (only long
>> > > >    available) into timsetamp fields in Parquet
>> > > >
>> > > >
>> > > > There will, in all likelihood, be future installment to this entry
>> as
>> > new
>> > > > questions arise.
>> > > >
>> > > > All help is appreciated.
>> > > >
>> > > > Regards,
>> > > >  -Stefan
>> > > >
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Reply via email to