Re: [DISCUSS] Remove required type

Parth Chandra Tue, 22 Mar 2016 15:08:49 -0700

I'm not entirely convinced that this would have no performance impact. Do
we have any experiments?



On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <[email protected]> wrote:

> My suggestion is we use explicit observation at the batch level. If there
> are no nulls we can optimize this batch. This would ultimately improve over
> our current situation where most parquet and all json data is nullable so
> we don't optimize. I'd estimate that the vast majority of Drills workloads
> are marked nullable whether they are or not. So what we're really
> suggesting is deleting a bunch of code which is rarely in the execution
> path.
> On Mar 22, 2016 1:22 PM, "Aman Sinha" <[email protected]> wrote:
>
> > I was thinking about it more after sending the previous concerns.  Agree,
> > this is an execution side change...but some details need to be worked
> out.
> > If the planner indicates to the executor that a column is non-nullable
> (e.g
> > a primary key),  the run-time generated code is more efficient since it
> > does not have to check the null bit.  Are you thinking we would use the
> > existing nullable vector and add some additional metadata (at a record
> > batch level rather than record level) to indicate non-nullability ?
> >
> >
> > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > Hey Aman, I believe both Steven and I were only suggesting removal only
> > > from execution, not planning. It seems like your concerns are all
> related
> > > to planning. Iit seems like the real tradeoffs in execution are
> nominal.
> > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <[email protected]> wrote:
> > >
> > > > While it is true that there is code complexity due to the required
> > type,
> > > > what would we be trading off ?  some important considerations:
> > > >   - We don't currently have null count statistics which would need to
> > be
> > > > implemented for various data sources
> > > >   - Primary keys in the RDBMS sources (or rowkeys in hbase) are
> always
> > > > non-null, and although today we may not be doing optimizations to
> > > leverage
> > > > that,  one could easily add a rule that converts  WHERE primary_key
> IS
> > > NULL
> > > > to a FALSE filter.
> > > >
> > > >
> > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Jacques,
> > > > > Marginally related to this, I made a small change in PR-372
> > > (DRILL-4184)
> > > > > to support variable widths for decimal quantities in Parquet.  I
> > found
> > > > the
> > > > > (decimal) vectoring code to be very difficult to understand
> (probably
> > > > > because it's overly complex, but also because I'm new to Drill code
> > in
> > > > > general), so I made a small, surgical change in my pull request to
> > > > support
> > > > > keeping track of variable widths (lengths) and null booleans within
> > the
> > > > > existing fixed width decimal vectoring scheme.  Can my changes be
> > > > > reviewed/accepted, and then we discuss how to fix properly
> long-term?
> > > > >
> > > > > Thanks,
> > > > > Dave Oshinsky
> > > > >
> > > > > -----Original Message-----
> > > > > From: Jacques Nadeau [mailto:[email protected]]
> > > > > Sent: Monday, March 21, 2016 11:43 PM
> > > > > To: dev
> > > > > Subject: Re: [DISCUSS] Remove required type
> > > > >
> > > > > Definitely in support of this. The required type is a huge
> > maintenance
> > > > and
> > > > > code complexity nightmare that provides little to no benefit. As
> you
> > > > point
> > > > > out, we can do better performance optimizations though null count
> > > > > observation since most sources are nullable anyway.
> > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <[email protected]>
> > wrote:
> > > > >
> > > > > > I have been thinking about this for a while now, and I feel it
> > would
> > > > > > be a good idea to remove the Required vector types from Drill,
> and
> > > > > > only use the Nullable version of vectors. I think this will
> greatly
> > > > > simplify the code.
> > > > > > It will also simplify the creation of UDFs. As is, if a function
> > has
> > > > > > custom null handling (i.e. INTERNAL), the function has to be
> > > > > > separately implemented for each permutation of nullability of the
> > > > > > inputs. But if drill data types are always nullable, this
> wouldn't
> > > be a
> > > > > problem.
> > > > > >
> > > > > > I don't think there would be much impact on performance. In
> > practice,
> > > > > > I think the required type is used very rarely. And there are
> other
> > > > > > ways we can optimize for when a column is known to have no nulls.
> > > > > >
> > > > > > Thoughts?
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > ***************************Legal
> > Disclaimer***************************
> > > > > "This communication may contain confidential and privileged
> material
> > > for
> > > > > the
> > > > > sole use of the intended recipient. Any unauthorized review, use or
> > > > > distribution
> > > > > by others is strictly prohibited. If you have received the message
> by
> > > > > mistake,
> > > > > please advise the sender by reply email and delete the message.
> Thank
> > > > you."
> > > > >
> > **********************************************************************
> > > >
> > >
> >
>

Re: [DISCUSS] Remove required type

Reply via email to