I'm not entirely convinced that this would have no performance impact. Do we have any experiments?
On Tue, Mar 22, 2016 at 1:36 PM, Jacques Nadeau <[email protected]> wrote: > My suggestion is we use explicit observation at the batch level. If there > are no nulls we can optimize this batch. This would ultimately improve over > our current situation where most parquet and all json data is nullable so > we don't optimize. I'd estimate that the vast majority of Drills workloads > are marked nullable whether they are or not. So what we're really > suggesting is deleting a bunch of code which is rarely in the execution > path. > On Mar 22, 2016 1:22 PM, "Aman Sinha" <[email protected]> wrote: > > > I was thinking about it more after sending the previous concerns. Agree, > > this is an execution side change...but some details need to be worked > out. > > If the planner indicates to the executor that a column is non-nullable > (e.g > > a primary key), the run-time generated code is more efficient since it > > does not have to check the null bit. Are you thinking we would use the > > existing nullable vector and add some additional metadata (at a record > > batch level rather than record level) to indicate non-nullability ? > > > > > > On Tue, Mar 22, 2016 at 12:27 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > Hey Aman, I believe both Steven and I were only suggesting removal only > > > from execution, not planning. It seems like your concerns are all > related > > > to planning. Iit seems like the real tradeoffs in execution are > nominal. > > > On Mar 22, 2016 9:03 AM, "Aman Sinha" <[email protected]> wrote: > > > > > > > While it is true that there is code complexity due to the required > > type, > > > > what would we be trading off ? some important considerations: > > > > - We don't currently have null count statistics which would need to > > be > > > > implemented for various data sources > > > > - Primary keys in the RDBMS sources (or rowkeys in hbase) are > always > > > > non-null, and although today we may not be doing optimizations to > > > leverage > > > > that, one could easily add a rule that converts WHERE primary_key > IS > > > NULL > > > > to a FALSE filter. > > > > > > > > > > > > On Tue, Mar 22, 2016 at 7:31 AM, Dave Oshinsky < > > [email protected]> > > > > wrote: > > > > > > > > > Hi Jacques, > > > > > Marginally related to this, I made a small change in PR-372 > > > (DRILL-4184) > > > > > to support variable widths for decimal quantities in Parquet. I > > found > > > > the > > > > > (decimal) vectoring code to be very difficult to understand > (probably > > > > > because it's overly complex, but also because I'm new to Drill code > > in > > > > > general), so I made a small, surgical change in my pull request to > > > > support > > > > > keeping track of variable widths (lengths) and null booleans within > > the > > > > > existing fixed width decimal vectoring scheme. Can my changes be > > > > > reviewed/accepted, and then we discuss how to fix properly > long-term? > > > > > > > > > > Thanks, > > > > > Dave Oshinsky > > > > > > > > > > -----Original Message----- > > > > > From: Jacques Nadeau [mailto:[email protected]] > > > > > Sent: Monday, March 21, 2016 11:43 PM > > > > > To: dev > > > > > Subject: Re: [DISCUSS] Remove required type > > > > > > > > > > Definitely in support of this. The required type is a huge > > maintenance > > > > and > > > > > code complexity nightmare that provides little to no benefit. As > you > > > > point > > > > > out, we can do better performance optimizations though null count > > > > > observation since most sources are nullable anyway. > > > > > On Mar 21, 2016 7:41 PM, "Steven Phillips" <[email protected]> > > wrote: > > > > > > > > > > > I have been thinking about this for a while now, and I feel it > > would > > > > > > be a good idea to remove the Required vector types from Drill, > and > > > > > > only use the Nullable version of vectors. I think this will > greatly > > > > > simplify the code. > > > > > > It will also simplify the creation of UDFs. As is, if a function > > has > > > > > > custom null handling (i.e. INTERNAL), the function has to be > > > > > > separately implemented for each permutation of nullability of the > > > > > > inputs. But if drill data types are always nullable, this > wouldn't > > > be a > > > > > problem. > > > > > > > > > > > > I don't think there would be much impact on performance. In > > practice, > > > > > > I think the required type is used very rarely. And there are > other > > > > > > ways we can optimize for when a column is known to have no nulls. > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > ***************************Legal > > Disclaimer*************************** > > > > > "This communication may contain confidential and privileged > material > > > for > > > > > the > > > > > sole use of the intended recipient. Any unauthorized review, use or > > > > > distribution > > > > > by others is strictly prohibited. If you have received the message > by > > > > > mistake, > > > > > please advise the sender by reply email and delete the message. > Thank > > > > you." > > > > > > > ********************************************************************** > > > > > > > > > >
