Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Per my comments there, the introduction of field buffers was added as part of the fieldvector addition when we have vectors that weren't field level. This meant that getbuffers and getfieldbuffers were at different levels at hierarchy (getbuffers being more general). I believe we no longer have the case of a vector that is not a field. As such, collapsing fieldvector and valuevector makes a lot of sense. In the thread I tried to say that we should capture the use cases and then consolidate to support those use cases. I think there are definitely two use cases (with sub in one of them) that I'm aware of: - get me buffers for writing to a io (socket or disk) - transfer the buffers - keep the buffers in vector (double reference counts) - get me the buffers so I can do some kind of low level memory operations (cheap as possible please) I actually think we've also failed to complete the other work that we talked about wrt removing readerIndex and writerIndex from ArrowBuf. They aren't well maintained by any of the vectors and thus are mostly broken (and create a bunch of this confusion). So I'd actually be inclined to fix the readerIndex/writerIndex issue before we mess with the buffer methods. Removing that whole concept will collapse the use cases, I believe. Then we only have transfer versus not (which is what the getBuffers() method already provides). On Mon, Aug 10, 2020 at 1:44 AM Ji Liu wrote: > Hi Micah, I am afraid it's not a reasonable solution. > 1. The status is that getFieldBuffers has right order buffer and was used > in IPC, getBuffers was not used in IPC. > 2. The purpose of this PR is to use getBuffers in IPC instead, and making > changes in getFieldBuffers dose not seem to help this problem since it will > break IPC format by using getBuffers. > > Micah Kornfield 于2020年8月8日周六 上午11:50写道: > > > Thinking about this some more, I think maybe we should also potentially > > try to deprecate hold off on any changes to getFieldBuffers. It should > > likely follow the same sort of pattern for getBuffers (create a new > method > > that doesn't set reader/writer indices and deprecate getFieldBuffers). > But > > I think that can be handled in a separate PR? > > > > Anybody else have thoughts? > > > > -Micah > > > > On Tue, Aug 4, 2020 at 11:24 PM Ji Liu wrote: > > > > > hi liya, > > > Thanks for your careful review, it is a typo, the order of getBuffers > is > > > wrong. > > > > > > Fan Liya 于2020年8月5日周三 下午2:14写道: > > > > > > > Hi Ji, > > > > > > > > IMO, for the correct order, the validity buffer should precede the > > offset > > > > buffer (e.g. this is the order used by BaseVariableWidthVector & > > > > BaseLargeVariableWidthVector). > > > > In ListVector#getBuffers, the offset buffer precedes the validity > > buffer, > > > > so I am a little confused why you say the order of > > ListVector#getBuffers > > > is > > > > right? > > > > > > > > Best, > > > > Liya Fan > > > > > > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield < > emkornfi...@gmail.com > > > > > > > wrote: > > > > > > > > > FWIW, I lack historical context on how these methods evolved, so > I'd > > > > > appreciate insight from anyone who has worked on the java codebase > > for > > > a > > > > > longer period of time. The current situation seems less then > ideal. > > > > > > > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure > > > what's > > > > > the > > > > > > proper way to solve it. > > > > > > > > > > > > > > > > > > This issue was about to avoid set reader/writer indices in > > > > > > FieldVector#getFieldBuffers according to the following reasons: > > > > > > > > > > > > i. getBuffers set reader/writer indices and it's right for the > > > purpose > > > > of > > > > > > sending the data over the wire > > > > > > > > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for > > > > getting > > > > > > access to underlying data for higher-performance algorithms > > > > > > > > > > > > > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > > > > > ArrowRecordBatch that's why we keep writer/reader indices in > > > > > > getFieldBuffers > > > > > > , we should use getBuffers instead. But during the change, we > found > > > > > another > > > > > > problem: > > > > > > > > > > > > The order of validity and offset buffers are not in the same > order > > in > > > > > > ListVector(getBuffers's order is right), changing the API in > > > > > VectorUnloader > > > > > > creates problems with serialization/deserialization resulting in > > test > > > > > > failures in Dremio which would break backward compatibility with > > > > existing > > > > > > serialised files. > > > > > > > > > > > > > > > > > > Micah gives a solution but seems doesn't reach consistent in the > PR > > > > > > thread[2 > > > > > > ]: > > > > > > > > > > > >1. Remove setReaderWrite
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Hi Micah, I am afraid it's not a reasonable solution. 1. The status is that getFieldBuffers has right order buffer and was used in IPC, getBuffers was not used in IPC. 2. The purpose of this PR is to use getBuffers in IPC instead, and making changes in getFieldBuffers dose not seem to help this problem since it will break IPC format by using getBuffers. Micah Kornfield 于2020年8月8日周六 上午11:50写道: > Thinking about this some more, I think maybe we should also potentially > try to deprecate hold off on any changes to getFieldBuffers. It should > likely follow the same sort of pattern for getBuffers (create a new method > that doesn't set reader/writer indices and deprecate getFieldBuffers). But > I think that can be handled in a separate PR? > > Anybody else have thoughts? > > -Micah > > On Tue, Aug 4, 2020 at 11:24 PM Ji Liu wrote: > > > hi liya, > > Thanks for your careful review, it is a typo, the order of getBuffers is > > wrong. > > > > Fan Liya 于2020年8月5日周三 下午2:14写道: > > > > > Hi Ji, > > > > > > IMO, for the correct order, the validity buffer should precede the > offset > > > buffer (e.g. this is the order used by BaseVariableWidthVector & > > > BaseLargeVariableWidthVector). > > > In ListVector#getBuffers, the offset buffer precedes the validity > buffer, > > > so I am a little confused why you say the order of > ListVector#getBuffers > > is > > > right? > > > > > > Best, > > > Liya Fan > > > > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield > > > > wrote: > > > > > > > FWIW, I lack historical context on how these methods evolved, so I'd > > > > appreciate insight from anyone who has worked on the java codebase > for > > a > > > > longer period of time. The current situation seems less then ideal. > > > > > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu wrote: > > > > > > > > > Hi all, > > > > > > > > > > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure > > what's > > > > the > > > > > proper way to solve it. > > > > > > > > > > > > > > > This issue was about to avoid set reader/writer indices in > > > > > FieldVector#getFieldBuffers according to the following reasons: > > > > > > > > > > i. getBuffers set reader/writer indices and it's right for the > > purpose > > > of > > > > > sending the data over the wire > > > > > > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for > > > getting > > > > > access to underlying data for higher-performance algorithms > > > > > > > > > > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > > > > ArrowRecordBatch that's why we keep writer/reader indices in > > > > > getFieldBuffers > > > > > , we should use getBuffers instead. But during the change, we found > > > > another > > > > > problem: > > > > > > > > > > The order of validity and offset buffers are not in the same order > in > > > > > ListVector(getBuffers's order is right), changing the API in > > > > VectorUnloader > > > > > creates problems with serialization/deserialization resulting in > test > > > > > failures in Dremio which would break backward compatibility with > > > existing > > > > > serialised files. > > > > > > > > > > > > > > > Micah gives a solution but seems doesn't reach consistent in the PR > > > > > thread[2 > > > > > ]: > > > > > > > > > >1. Remove setReaderWriterIndeces in getFieldBuffers > > > > >2. Deprecate getBuffers > > > > >3. Introduce a new getIpcBuffers which is unambiguously used for > > > > writing > > > > >record batches (i.e. in VectorUnloader). > > > > >4. Update documentation where it makes sense based on all this > > > > >conversation. > > > > > > > > > > > > > > > More details and discussions can be seen from the PR and hope to > get > > > more > > > > > feedback. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Ji Liu > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539 > > > > > > > > > > [2] https://github.com/apache/arrow/pull/6156 > > > > > > > > > > > > > > >
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Thinking about this some more, I think maybe we should also potentially try to deprecate hold off on any changes to getFieldBuffers. It should likely follow the same sort of pattern for getBuffers (create a new method that doesn't set reader/writer indices and deprecate getFieldBuffers). But I think that can be handled in a separate PR? Anybody else have thoughts? -Micah On Tue, Aug 4, 2020 at 11:24 PM Ji Liu wrote: > hi liya, > Thanks for your careful review, it is a typo, the order of getBuffers is > wrong. > > Fan Liya 于2020年8月5日周三 下午2:14写道: > > > Hi Ji, > > > > IMO, for the correct order, the validity buffer should precede the offset > > buffer (e.g. this is the order used by BaseVariableWidthVector & > > BaseLargeVariableWidthVector). > > In ListVector#getBuffers, the offset buffer precedes the validity buffer, > > so I am a little confused why you say the order of ListVector#getBuffers > is > > right? > > > > Best, > > Liya Fan > > > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield > > wrote: > > > > > FWIW, I lack historical context on how these methods evolved, so I'd > > > appreciate insight from anyone who has worked on the java codebase for > a > > > longer period of time. The current situation seems less then ideal. > > > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu wrote: > > > > > > > Hi all, > > > > > > > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure > what's > > > the > > > > proper way to solve it. > > > > > > > > > > > > This issue was about to avoid set reader/writer indices in > > > > FieldVector#getFieldBuffers according to the following reasons: > > > > > > > > i. getBuffers set reader/writer indices and it's right for the > purpose > > of > > > > sending the data over the wire > > > > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for > > getting > > > > access to underlying data for higher-performance algorithms > > > > > > > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > > > ArrowRecordBatch that's why we keep writer/reader indices in > > > > getFieldBuffers > > > > , we should use getBuffers instead. But during the change, we found > > > another > > > > problem: > > > > > > > > The order of validity and offset buffers are not in the same order in > > > > ListVector(getBuffers's order is right), changing the API in > > > VectorUnloader > > > > creates problems with serialization/deserialization resulting in test > > > > failures in Dremio which would break backward compatibility with > > existing > > > > serialised files. > > > > > > > > > > > > Micah gives a solution but seems doesn't reach consistent in the PR > > > > thread[2 > > > > ]: > > > > > > > >1. Remove setReaderWriterIndeces in getFieldBuffers > > > >2. Deprecate getBuffers > > > >3. Introduce a new getIpcBuffers which is unambiguously used for > > > writing > > > >record batches (i.e. in VectorUnloader). > > > >4. Update documentation where it makes sense based on all this > > > >conversation. > > > > > > > > > > > > More details and discussions can be seen from the PR and hope to get > > more > > > > feedback. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Ji Liu > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539 > > > > > > > > [2] https://github.com/apache/arrow/pull/6156 > > > > > > > > > >
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
hi liya, Thanks for your careful review, it is a typo, the order of getBuffers is wrong. Fan Liya 于2020年8月5日周三 下午2:14写道: > Hi Ji, > > IMO, for the correct order, the validity buffer should precede the offset > buffer (e.g. this is the order used by BaseVariableWidthVector & > BaseLargeVariableWidthVector). > In ListVector#getBuffers, the offset buffer precedes the validity buffer, > so I am a little confused why you say the order of ListVector#getBuffers is > right? > > Best, > Liya Fan > > On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield > wrote: > > > FWIW, I lack historical context on how these methods evolved, so I'd > > appreciate insight from anyone who has worked on the java codebase for a > > longer period of time. The current situation seems less then ideal. > > > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu wrote: > > > > > Hi all, > > > > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure what's > > the > > > proper way to solve it. > > > > > > > > > This issue was about to avoid set reader/writer indices in > > > FieldVector#getFieldBuffers according to the following reasons: > > > > > > i. getBuffers set reader/writer indices and it's right for the purpose > of > > > sending the data over the wire > > > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for > getting > > > access to underlying data for higher-performance algorithms > > > > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > > ArrowRecordBatch that's why we keep writer/reader indices in > > > getFieldBuffers > > > , we should use getBuffers instead. But during the change, we found > > another > > > problem: > > > > > > The order of validity and offset buffers are not in the same order in > > > ListVector(getBuffers's order is right), changing the API in > > VectorUnloader > > > creates problems with serialization/deserialization resulting in test > > > failures in Dremio which would break backward compatibility with > existing > > > serialised files. > > > > > > > > > Micah gives a solution but seems doesn't reach consistent in the PR > > > thread[2 > > > ]: > > > > > >1. Remove setReaderWriterIndeces in getFieldBuffers > > >2. Deprecate getBuffers > > >3. Introduce a new getIpcBuffers which is unambiguously used for > > writing > > >record batches (i.e. in VectorUnloader). > > >4. Update documentation where it makes sense based on all this > > >conversation. > > > > > > > > > More details and discussions can be seen from the PR and hope to get > more > > > feedback. > > > > > > > > > > > > Thanks, > > > > > > Ji Liu > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539 > > > > > > [2] https://github.com/apache/arrow/pull/6156 > > > > > >
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Hi Ji, IMO, for the correct order, the validity buffer should precede the offset buffer (e.g. this is the order used by BaseVariableWidthVector & BaseLargeVariableWidthVector). In ListVector#getBuffers, the offset buffer precedes the validity buffer, so I am a little confused why you say the order of ListVector#getBuffers is right? Best, Liya Fan On Wed, Aug 5, 2020 at 12:32 PM Micah Kornfield wrote: > FWIW, I lack historical context on how these methods evolved, so I'd > appreciate insight from anyone who has worked on the java codebase for a > longer period of time. The current situation seems less then ideal. > > On Tue, Aug 4, 2020 at 12:55 AM Ji Liu wrote: > > > Hi all, > > > > > > When I worked on ARROW-7539[1], I met some problems and not sure what's > the > > proper way to solve it. > > > > > > This issue was about to avoid set reader/writer indices in > > FieldVector#getFieldBuffers according to the following reasons: > > > > i. getBuffers set reader/writer indices and it's right for the purpose of > > sending the data over the wire > > > > ii. getFieldBuffers is distinct from getBuffers, it should be for getting > > access to underlying data for higher-performance algorithms > > > > > > Currently in VectorUnloader, we used getFieldBuffers to create > > ArrowRecordBatch that's why we keep writer/reader indices in > > getFieldBuffers > > , we should use getBuffers instead. But during the change, we found > another > > problem: > > > > The order of validity and offset buffers are not in the same order in > > ListVector(getBuffers's order is right), changing the API in > VectorUnloader > > creates problems with serialization/deserialization resulting in test > > failures in Dremio which would break backward compatibility with existing > > serialised files. > > > > > > Micah gives a solution but seems doesn't reach consistent in the PR > > thread[2 > > ]: > > > >1. Remove setReaderWriterIndeces in getFieldBuffers > >2. Deprecate getBuffers > >3. Introduce a new getIpcBuffers which is unambiguously used for > writing > >record batches (i.e. in VectorUnloader). > >4. Update documentation where it makes sense based on all this > >conversation. > > > > > > More details and discussions can be seen from the PR and hope to get more > > feedback. > > > > > > > > Thanks, > > > > Ji Liu > > > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539 > > > > [2] https://github.com/apache/arrow/pull/6156 > > >
Re: [DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
FWIW, I lack historical context on how these methods evolved, so I'd appreciate insight from anyone who has worked on the java codebase for a longer period of time. The current situation seems less then ideal. On Tue, Aug 4, 2020 at 12:55 AM Ji Liu wrote: > Hi all, > > > When I worked on ARROW-7539[1], I met some problems and not sure what's the > proper way to solve it. > > > This issue was about to avoid set reader/writer indices in > FieldVector#getFieldBuffers according to the following reasons: > > i. getBuffers set reader/writer indices and it's right for the purpose of > sending the data over the wire > > ii. getFieldBuffers is distinct from getBuffers, it should be for getting > access to underlying data for higher-performance algorithms > > > Currently in VectorUnloader, we used getFieldBuffers to create > ArrowRecordBatch that's why we keep writer/reader indices in > getFieldBuffers > , we should use getBuffers instead. But during the change, we found another > problem: > > The order of validity and offset buffers are not in the same order in > ListVector(getBuffers's order is right), changing the API in VectorUnloader > creates problems with serialization/deserialization resulting in test > failures in Dremio which would break backward compatibility with existing > serialised files. > > > Micah gives a solution but seems doesn't reach consistent in the PR > thread[2 > ]: > >1. Remove setReaderWriterIndeces in getFieldBuffers >2. Deprecate getBuffers >3. Introduce a new getIpcBuffers which is unambiguously used for writing >record batches (i.e. in VectorUnloader). >4. Update documentation where it makes sense based on all this >conversation. > > > More details and discussions can be seen from the PR and hope to get more > feedback. > > > > Thanks, > > Ji Liu > > > > [1] https://issues.apache.org/jira/browse/ARROW-7539 > > [2] https://github.com/apache/arrow/pull/6156 >
[DISSCUSS][JAVA] Avoid set reader/writer indices in FieldVector#getFieldBuffers
Hi all, When I worked on ARROW-7539[1], I met some problems and not sure what's the proper way to solve it. This issue was about to avoid set reader/writer indices in FieldVector#getFieldBuffers according to the following reasons: i. getBuffers set reader/writer indices and it's right for the purpose of sending the data over the wire ii. getFieldBuffers is distinct from getBuffers, it should be for getting access to underlying data for higher-performance algorithms Currently in VectorUnloader, we used getFieldBuffers to create ArrowRecordBatch that's why we keep writer/reader indices in getFieldBuffers , we should use getBuffers instead. But during the change, we found another problem: The order of validity and offset buffers are not in the same order in ListVector(getBuffers's order is right), changing the API in VectorUnloader creates problems with serialization/deserialization resulting in test failures in Dremio which would break backward compatibility with existing serialised files. Micah gives a solution but seems doesn't reach consistent in the PR thread[2 ]: 1. Remove setReaderWriterIndeces in getFieldBuffers 2. Deprecate getBuffers 3. Introduce a new getIpcBuffers which is unambiguously used for writing record batches (i.e. in VectorUnloader). 4. Update documentation where it makes sense based on all this conversation. More details and discussions can be seen from the PR and hope to get more feedback. Thanks, Ji Liu [1] https://issues.apache.org/jira/browse/ARROW-7539 [2] https://github.com/apache/arrow/pull/6156