Re: Directory and file based partition pruning

Jinfeng Ni Thu, 10 Sep 2015 18:19:39 -0700

I opened DRILL-3765 for the multiple rule execution issue:

https://issues.apache.org/jira/browse/DRILL-3765



On Thu, Sep 10, 2015 at 5:34 PM, Jinfeng Ni <jinfengn...@gmail.com> wrote:
> Seems to me one important reason we hit out of heap memory for partition
> prune rule is that the rule itself is invoked multiple times, even the
> filter has been pushed into scan in the first call.
>
> I tried with a simple unit test
> TestPartitionFilter:testPartitionFilter1_Parquet_from_CTAS(), here is the #
> of frequency of partition rules that are fired in Calcite trace
>
>  #_rule_fire,  rule name
>
>  4 [PruneScanRule:Filter_On_Project_Parquet]
>  4 [PruneScanRule:Filter_On_Project]
>
>  2 [PruneScanRule:Filter_On_Scan_Parquet]
>  2 [PruneScanRule:Filter_On_Scan]
>
> Setting a breaking point in PruneScanRule where it calls the interpreter to
> evaluate the expression, I could see that the code stops 6 times in that
> point; meaning that Drill will have to build the vector containing the
> filenames at least 6 times.  That would cause lots of heap memory
> consumption, if gc does not kick in to release the memory used in the prior
> rule's execution.
>
> I think making the partition pruning multiple phases will help to reduce the
> memory consumption. But for now, it seems important to avoid the repeated
> and unnecessary rule execution.
>
>
>
>
>
> On Thu, Sep 10, 2015 at 4:42 PM, Aman Sinha <asi...@maprtech.com> wrote:
>>
>> Agree on the N phased approach.  I have filed a JIRA for the enhancement:
>>  DRILL-3759.
>> Regarding the simplification of the expression tree logic..did you mean
>> the
>> logic in FindPartitionConditions  or the Interpreter ?
>> Perhaps you can add comments in the JIRA with some explanation.  I am in
>> favor of simplification where possible.
>>
>> On Wed, Sep 9, 2015 at 10:39 PM, Jacques Nadeau <jacq...@dremio.com>
>> wrote:
>>
>> > Makes sense.
>> >
>> > Is there we can do this with lazy materializations rather than writing
>> > complex expression tree logic? I hate have no all this custom expression
>> > tree manipulation logic.
>> >
>> > Also, it seems like this should be N phased rather than two phase where
>> > N
>> > is the number of directories below the base path.
>> >
>> > Thoughts?
>> > On Sep 9, 2015 10:54 AM, "Aman Sinha" <amansi...@apache.org> wrote:
>> >
>> > > Currently, partition pruning gets all file names in the table and
>> > > applies
>> > > the pruning.  Suppose the files are spread out over several
>> > > directories
>> > and
>> > > there is a filter  on dirN,  this is not efficient - both in terms of
>> > > elapsed time and memory usage.  This has been seen in a few use cases
>> > > recently.
>> > >
>> > > We should ideally perform the pruning in 2 steps:  first get the
>> > top-level
>> > > directory names only and apply the directory filter, then get the
>> > filenames
>> > > within that directory and apply remaining filters.
>> > >
>> > > I will create a JIRA for this enhancement but let me know your
>> > thoughts...
>> > >
>> > > Aman
>> > >
>> >
>
>

Re: Directory and file based partition pruning

Reply via email to