Re: Hangout Discussion Topics for 04-16-2019

Paul Rogers Wed, 24 Apr 2019 12:00:49 -0700

Hi Igor,

Thanks for the recap. You asked about vector allocation. Here is where I think 
things stand. Others can fill in details that I may miss.

We have several ways to size value vectors; but no single standard. As you 
note, the most common way is simply to accept the cost of letting the vector 
double in size multiple times.

One way to pre-allocate vectors is to use the "sizer" along with its associated 
allocation helper. This was always meant to be a quick & dirty temporary 
solution, but has turned out, I believe, to be the primary vector size 
management solution in most operators.

Another is the new row set framework: vector size (in terms of number of items 
and estimated item size) is expressed in metadata, then is used to allocate 
each new batch to the desired size.

You can also just do the work yourself: pick a number, and, when allocating a 
vector, tell it to use that size. You then take on the task of estimating 
average width, picking a good target number of rows for your batch, working out 
the number of items in arrays, etc. (This is, in fact, what the other two 
methods mentioned above actually do.)

The key problem with the ad-hoc techniques is that they can't limit maximum 
vector size to 16 MB (to avoid Netty fragmentation) nor limit overall batch 
size to some reasonable number. The ad-hoc techniques can also lead to internal 
fragmentation (excessive unused space within each vector.) Solving these 
problems is what the row set framework was designed to do.

Thanks,
- Paul

    On Wednesday, April 24, 2019, 10:48:44 AM PDT, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  

 Hello Everyone,

Sorry for the late reply, here is presentations about

Map<K,V> vector    -
https://docs.google.com/presentation/d/1FG4swOrkFIRL7qjiP7PSOPy8a1vnxs5Z9PM3ZfRPRYo/edit#slide=id.p
Hive complex types  -
https://docs.google.com/presentation/d/1nc0ID5aju-qj-7hjquFpH-TwGjeReWTYogsExuOe8ZA/edit?usp=sharing
.

Discussion results for Map<K,V> new vector:
- Need to eliminate possibility of key duplication;
- Need to check Hive behavior when ORDER BY is performed for Map
complex type column;
- Need to describe design and all use cases for the vector in design document.

Discussion results for Hive complex types:
- Aman Sinha made few great suggestions. First is that creation of
Hive writers may be done once for table scan and second is that at
this moment
  would be good to calculate size for vectors and allocate early. Need
to provide few examples describing how will the allocation work for
complex types.
- Need to describe suggested approach in design document and proceed
discussion there.

Question from my side. Do we have already implemented somewhere
predicted allocation of value vectors ? Any example would be useful,
because
now I can see that our existing vector writers usually use mutator's
setSafe(...) methods inside which size of buffer may be increased when
necessary.

The future design document will be located at
https://docs.google.com/document/d/1yEcaJi9dyksfMs4w5_GsZCQH_Pffe-HLeLVNNKsV7CA/edit?usp=sharing
.
Please feel free to leave your comments and suggestions in the
document and presentations.

Thanks,
Igor Guzenko

On Wed, Apr 17, 2019 at 3:04 AM Jyothsna Reddy <jyothsna....@gmail.com> wrote:
>
> Hi All,
> The hangout will start at 9:30 AM PST instead of 10 AM PST on 04-18-2019.
>
>
> Thank you,
> Jyothsna
>
>
>
>
> On Tue, Apr 16, 2019 at 2:00 PM Jyothsna Reddy <jyothsna....@gmail.com>
> wrote:
>
> > Hi Charles,
> > Yes, sure!! Probably we can start with your discussion first and Hive
> > complex types later since there will be some discussion around the later
> > topic.
> >
> > Thank you,
> > Jyothsna
> >
> >
> >
> >
> > On Tue, Apr 16, 2019 at 1:40 PM Charles Givre <cgi...@gmail.com> wrote:
> >
> >> Hi Jyothsna,
> >> Could I get a few minutes on the next Hangout to promote the Drill day at
> >> ApacheCon?
> >> Thanks
> >>
> >> > On Apr 16, 2019, at 16:38, Jyothsna Reddy <jyothsna....@gmail.com>
> >> wrote:
> >> >
> >> > Hi Everyone,
> >> >
> >> > Here are some key points of today's hangout discussion:
> >> >
> >> > Sorabh mentioned that there are some regressions in TPCDS queries and
> >> its a
> >> > blocker for 1.16 release.
> >> >
> >> > Bohdan presented tehir proposal for Hive Complex types support. Here are
> >> > some of the important points
> >> >
> >> >  - Structure of MapVector : Keys are of primitive type where values can
> >> >  be of either primitive or complex type.
> >> >  - MapReader and MapWriter are used to read and write from the
> >> MapVector
> >> >  - MapWriter tracks the current row/length and is used to calculate
> >> write
> >> >  position and offset
> >> >
> >> > Following are some of the questions from the audience
> >> >
> >> >  - Will the types be implicitly casted since calcite supports keys of
> >> >  type int and string.
> >> >  - Future improvements include sorting the keys for better lookup, Is
> >> it
> >> >  per row or across all the rows?
> >> >
> >> > Since there is more to discuss, there will be a hangout session on
> >> > 04-18-2019 at 10 AM PST (link
> >> > http://meet.google.com/yki-iqdf-tai).
> >> >
> >> > Thank you,
> >> > Jyothsna
> >> >
> >> >
> >> >
> >> > On Mon, Apr 15, 2019 at 11:48 AM Bohdan Kazydub <
> >> bohdan.kazy...@gmail.com>
> >> > wrote:
> >> >
> >> >> Hello,
> >> >> Igor and I would like to discuss Hive Complex types support.
> >> >>
> >> >> Thanks,
> >> >> Bohdan
> >> >>
> >> >> On Mon, Apr 15, 2019 at 8:47 PM Charles Givre <cgi...@gmail.com>
> >> wrote:
> >> >>
> >> >>> I’d like to promote the Drill track for ApacheCon.
> >> >>>
> >> >>> Sent from my iPhone
> >> >>>
> >> >>>> On Apr 15, 2019, at 13:09, Jyothsna Reddy <jyothsna....@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Hello Everyone,
> >> >>>> Does anyone have any topics for tomorrow's hangout?
> >> >>>>
> >> >>>> We will start the hangout at 10 AM PST (link
> >> >>>> http://meet.google.com/yki-iqdf-tai).
> >> >>>>
> >> >>>> Thank you,
> >> >>>> Jyothsna
> >> >>>
> >> >>
> >>
> >>

Re: Hangout Discussion Topics for 04-16-2019

Reply via email to