Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia
Okay, sounds good.

On Fri, Oct 13, 2017 at 2:50 PM, Wes McKinney  wrote:

> It is fine to have not-completely-working states in the refactor
> branch. I recommend do whatever is the most expedient thing to help
> with making progress.
>
> - Wes
>
> On Fri, Oct 13, 2017 at 5:42 PM, Siddharth Teotia 
> wrote:
> > Li,
> >
> > I think there is some confusion. Are you suggesting merging into "java
> > vector refactor" branch or the master? Is it fine to merge stuff on the
> > former branch even though few things are broken (around 10 tests) ? If
> this
> > is allowed, I can do some cleanup (some documentation, some TODOs
> suggested
> > by you and Brian) and we can merge the current patch by EOD or over the
> > weekend.
> >
> > Is this okay? Since we are going to iterate over this branch and not
> going
> > to push anything to master until new code is stable, we are probably
> good.
> >
> > Thanks,
> > Siddhath
> >
> >
> >
> > On Fri, Oct 13, 2017 at 12:17 PM, Li Jin  wrote:
> >
> >> Siddharth,
> >>
> >> Regarding rename:
> >> Yes this can be done later.
> >>
> >> Tests:
> >> I agree having code like https://github.com/apache/
> >> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
> >> maintain even temporarily. I am not sure what's the best way to resolve
> >> test failure wrt removing of the accessor/mutator from the vectors.
> Maybe
> >> we can have change the template the create non-accessor/mutator
> >> getter/setters and also remove acessor/mutator in the test for it to
> pass?
> >> What do you think is the easiest?
> >>
> >> Reader/Writer:
> >> Yes we can address this later.
> >>
> >> Apologies if I seem to add more work for merging https://github.com/
> >> apache/arrow/pull/1164, that's not my intention, I think the PR looks
> >> good -
> >> just want to bring up some major design decisions so people can comment
> and
> >> discuss.
> >>
> >> Li
> >>
> >>
> >>
> >>
> >> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia  >
> >> wrote:
> >>
> >> > I am not quite sure of the need to rename the vectors. Why do we need
> to
> >> > rename? This would first require us to remove all the vectors
> generated
> >> by
> >> > FixedValueVectors.java as they are non-nullable scalar vectors.
> Removing
> >> > non-nullable vectors is one of the goals, but it can be done once the
> new
> >> > infrastructure is properly setup?
> >> >
> >> > In order to merge the existing patch, I first need to address some
> >> (10-15)
> >> > failures -- few of them are correctness issues w.r.t
> >> TestVectorUnloadLoad,
> >> > TestArrowFile and rest all are related to getMutator(), getAccessor()
> >> > throwing UnsupportedOperationException. This is why I was saying
> earlier
> >> > that I will end up doing a lot of rework by writing redundant code
> where
> >> > (if vector instanceof NullableInt or vector instanceof
> NullableVarChar)
> >> we
> >> > don't use the mutator/accessor and for other vectors we use it for the
> >> > current patch. These if conditions are getting complicated with ugly
> type
> >> > casting in some parts of the code --
> >> > https://github.com/apache/arrow/pull/1164/files#diff-
> >> > 0876c9a0005d1dbaea321ea8d39d79ae
> >> >
> >> > So I thought we can probably implement other vectors (remaining
> scalars,
> >> > map and list) where no vector has mutator/accessor and then for every
> >> > ValueVector, we can remove all calls to getMutator(), getAccessor() as
> >> > opposed to doing them selectively ---
> >> > https://github.com/apache/arrow/pull/1164/files#diff-
> >> > e9273a7b3b35ff7f40f101dc2cf95242
> >> >
> >> > I will try to address these failures by EOD and see if this patch can
> be
> >> > merged first.
> >> >
> >> > Regarding readers and writers, can we address them subsequently?
> >> >
> >> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin 
> wrote:
> >> >
> >> > > Siddharth,
> >> > >
> >> > > Thanks for the update. I think it's fine to move forward with more
> >> > vectors,
> >> > > but in the mean time, I think we should also prioritize to merge
> >> > > https://github.com/apache/arrow/pull/1164, here are a few comments
> >> needs
> >> > > to
> >> > > be addressed.
> >> > >
> >> > > (1) Backward-compatibility:
> >> > > I think there is no way to maintain backward compability as the new
> >> > vector
> >> > > classes will be renamed, but want to confirm we are OK with this
> >> > decision.
> >> > > We also think the disruption on the Spark side are OK as Spark's use
> >> case
> >> > > is simple and Bryan and I can take care of the code change.
> >> > >
> >> > > (2) Reader/writer classes:
> >> > > How does the reader/writer classes interact with the new and legacy
> >> > vector
> >> > > classes:
> >> > >
> >> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> >> > > r144074264
> >> > >
> >> > > My thoughts are:
> >> > > (1) ArrowReader classes should only 

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Wes McKinney
It is fine to have not-completely-working states in the refactor
branch. I recommend do whatever is the most expedient thing to help
with making progress.

- Wes

On Fri, Oct 13, 2017 at 5:42 PM, Siddharth Teotia  wrote:
> Li,
>
> I think there is some confusion. Are you suggesting merging into "java
> vector refactor" branch or the master? Is it fine to merge stuff on the
> former branch even though few things are broken (around 10 tests) ? If this
> is allowed, I can do some cleanup (some documentation, some TODOs suggested
> by you and Brian) and we can merge the current patch by EOD or over the
> weekend.
>
> Is this okay? Since we are going to iterate over this branch and not going
> to push anything to master until new code is stable, we are probably good.
>
> Thanks,
> Siddhath
>
>
>
> On Fri, Oct 13, 2017 at 12:17 PM, Li Jin  wrote:
>
>> Siddharth,
>>
>> Regarding rename:
>> Yes this can be done later.
>>
>> Tests:
>> I agree having code like https://github.com/apache/
>> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
>> maintain even temporarily. I am not sure what's the best way to resolve
>> test failure wrt removing of the accessor/mutator from the vectors. Maybe
>> we can have change the template the create non-accessor/mutator
>> getter/setters and also remove acessor/mutator in the test for it to pass?
>> What do you think is the easiest?
>>
>> Reader/Writer:
>> Yes we can address this later.
>>
>> Apologies if I seem to add more work for merging https://github.com/
>> apache/arrow/pull/1164, that's not my intention, I think the PR looks
>> good -
>> just want to bring up some major design decisions so people can comment and
>> discuss.
>>
>> Li
>>
>>
>>
>>
>> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia 
>> wrote:
>>
>> > I am not quite sure of the need to rename the vectors. Why do we need to
>> > rename? This would first require us to remove all the vectors generated
>> by
>> > FixedValueVectors.java as they are non-nullable scalar vectors. Removing
>> > non-nullable vectors is one of the goals, but it can be done once the new
>> > infrastructure is properly setup?
>> >
>> > In order to merge the existing patch, I first need to address some
>> (10-15)
>> > failures -- few of them are correctness issues w.r.t
>> TestVectorUnloadLoad,
>> > TestArrowFile and rest all are related to getMutator(), getAccessor()
>> > throwing UnsupportedOperationException. This is why I was saying earlier
>> > that I will end up doing a lot of rework by writing redundant code where
>> > (if vector instanceof NullableInt or vector instanceof NullableVarChar)
>> we
>> > don't use the mutator/accessor and for other vectors we use it for the
>> > current patch. These if conditions are getting complicated with ugly type
>> > casting in some parts of the code --
>> > https://github.com/apache/arrow/pull/1164/files#diff-
>> > 0876c9a0005d1dbaea321ea8d39d79ae
>> >
>> > So I thought we can probably implement other vectors (remaining scalars,
>> > map and list) where no vector has mutator/accessor and then for every
>> > ValueVector, we can remove all calls to getMutator(), getAccessor() as
>> > opposed to doing them selectively ---
>> > https://github.com/apache/arrow/pull/1164/files#diff-
>> > e9273a7b3b35ff7f40f101dc2cf95242
>> >
>> > I will try to address these failures by EOD and see if this patch can be
>> > merged first.
>> >
>> > Regarding readers and writers, can we address them subsequently?
>> >
>> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:
>> >
>> > > Siddharth,
>> > >
>> > > Thanks for the update. I think it's fine to move forward with more
>> > vectors,
>> > > but in the mean time, I think we should also prioritize to merge
>> > > https://github.com/apache/arrow/pull/1164, here are a few comments
>> needs
>> > > to
>> > > be addressed.
>> > >
>> > > (1) Backward-compatibility:
>> > > I think there is no way to maintain backward compability as the new
>> > vector
>> > > classes will be renamed, but want to confirm we are OK with this
>> > decision.
>> > > We also think the disruption on the Spark side are OK as Spark's use
>> case
>> > > is simple and Bryan and I can take care of the code change.
>> > >
>> > > (2) Reader/writer classes:
>> > > How does the reader/writer classes interact with the new and legacy
>> > vector
>> > > classes:
>> > >
>> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_
>> > > r144074264
>> > >
>> > > My thoughts are:
>> > > (1) ArrowReader classes should only return new vector classes
>> > > (2) ArrowWriter classes should only work with new vector classes
>> > > (3) To read/write legacy vectors, we can use adapters to turn legacy
>> > > vectors to new vectors (zero-copy, as the underlying buffers should be
>> > > transferred directly)
>> > >
>> > > Jacques also has a few comments, I don't know if they have been
>> > addressed.
>> > >

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia
Li,

I think there is some confusion. Are you suggesting merging into "java
vector refactor" branch or the master? Is it fine to merge stuff on the
former branch even though few things are broken (around 10 tests) ? If this
is allowed, I can do some cleanup (some documentation, some TODOs suggested
by you and Brian) and we can merge the current patch by EOD or over the
weekend.

Is this okay? Since we are going to iterate over this branch and not going
to push anything to master until new code is stable, we are probably good.

Thanks,
Siddhath



On Fri, Oct 13, 2017 at 12:17 PM, Li Jin  wrote:

> Siddharth,
>
> Regarding rename:
> Yes this can be done later.
>
> Tests:
> I agree having code like https://github.com/apache/
> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
> maintain even temporarily. I am not sure what's the best way to resolve
> test failure wrt removing of the accessor/mutator from the vectors. Maybe
> we can have change the template the create non-accessor/mutator
> getter/setters and also remove acessor/mutator in the test for it to pass?
> What do you think is the easiest?
>
> Reader/Writer:
> Yes we can address this later.
>
> Apologies if I seem to add more work for merging https://github.com/
> apache/arrow/pull/1164, that's not my intention, I think the PR looks
> good -
> just want to bring up some major design decisions so people can comment and
> discuss.
>
> Li
>
>
>
>
> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia 
> wrote:
>
> > I am not quite sure of the need to rename the vectors. Why do we need to
> > rename? This would first require us to remove all the vectors generated
> by
> > FixedValueVectors.java as they are non-nullable scalar vectors. Removing
> > non-nullable vectors is one of the goals, but it can be done once the new
> > infrastructure is properly setup?
> >
> > In order to merge the existing patch, I first need to address some
> (10-15)
> > failures -- few of them are correctness issues w.r.t
> TestVectorUnloadLoad,
> > TestArrowFile and rest all are related to getMutator(), getAccessor()
> > throwing UnsupportedOperationException. This is why I was saying earlier
> > that I will end up doing a lot of rework by writing redundant code where
> > (if vector instanceof NullableInt or vector instanceof NullableVarChar)
> we
> > don't use the mutator/accessor and for other vectors we use it for the
> > current patch. These if conditions are getting complicated with ugly type
> > casting in some parts of the code --
> > https://github.com/apache/arrow/pull/1164/files#diff-
> > 0876c9a0005d1dbaea321ea8d39d79ae
> >
> > So I thought we can probably implement other vectors (remaining scalars,
> > map and list) where no vector has mutator/accessor and then for every
> > ValueVector, we can remove all calls to getMutator(), getAccessor() as
> > opposed to doing them selectively ---
> > https://github.com/apache/arrow/pull/1164/files#diff-
> > e9273a7b3b35ff7f40f101dc2cf95242
> >
> > I will try to address these failures by EOD and see if this patch can be
> > merged first.
> >
> > Regarding readers and writers, can we address them subsequently?
> >
> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:
> >
> > > Siddharth,
> > >
> > > Thanks for the update. I think it's fine to move forward with more
> > vectors,
> > > but in the mean time, I think we should also prioritize to merge
> > > https://github.com/apache/arrow/pull/1164, here are a few comments
> needs
> > > to
> > > be addressed.
> > >
> > > (1) Backward-compatibility:
> > > I think there is no way to maintain backward compability as the new
> > vector
> > > classes will be renamed, but want to confirm we are OK with this
> > decision.
> > > We also think the disruption on the Spark side are OK as Spark's use
> case
> > > is simple and Bryan and I can take care of the code change.
> > >
> > > (2) Reader/writer classes:
> > > How does the reader/writer classes interact with the new and legacy
> > vector
> > > classes:
> > >
> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> > > r144074264
> > >
> > > My thoughts are:
> > > (1) ArrowReader classes should only return new vector classes
> > > (2) ArrowWriter classes should only work with new vector classes
> > > (3) To read/write legacy vectors, we can use adapters to turn legacy
> > > vectors to new vectors (zero-copy, as the underlying buffers should be
> > > transferred directly)
> > >
> > > Jacques also has a few comments, I don't know if they have been
> > addressed.
> > >
> > > For other comments, I think we can add TODO and do it later. I think we
> > can
> > > merge this PR if we address (1) (2) above.
> > >
> > > Comments?
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia <
> siddha...@dremio.com
> > >
> > > wrote:
> > >
> > > > The patch that I have put up https://github.com/apache/
> 

[jira] [Created] (ARROW-1673) [Python] NumPy boolean arrays get converted to uint8 arrays on NdarrayToTensor roundtrip

2017-10-13 Thread Philipp Moritz (JIRA)
Philipp Moritz created ARROW-1673:
-

 Summary: [Python] NumPy boolean arrays get converted to uint8 
arrays on NdarrayToTensor roundtrip
 Key: ARROW-1673
 URL: https://issues.apache.org/jira/browse/ARROW-1673
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Philipp Moritz


see https://github.com/ray-project/ray/issues/1121



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1672) [Python] Failure to write Feather bytes column

2017-10-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1672:
---

 Summary: [Python] Failure to write Feather bytes column
 Key: ARROW-1672
 URL: https://issues.apache.org/jira/browse/ARROW-1672
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.8.0


See bug report in https://github.com/wesm/feather/issues/320



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1671) [C++] Change arrow::MakeArray to not return Status

2017-10-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-1671:
---

 Summary: [C++] Change arrow::MakeArray to not return Status
 Key: ARROW-1671
 URL: https://issues.apache.org/jira/browse/ARROW-1671
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
 Fix For: 0.8.0


It should not be possible for this function to fail. We can do a DCHECK 
internally of any internal Status values for testing purposes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Li Jin
Siddharth,

Regarding rename:
Yes this can be done later.

Tests:
I agree having code like https://github.com/apache/
arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
maintain even temporarily. I am not sure what's the best way to resolve
test failure wrt removing of the accessor/mutator from the vectors. Maybe
we can have change the template the create non-accessor/mutator
getter/setters and also remove acessor/mutator in the test for it to pass?
What do you think is the easiest?

Reader/Writer:
Yes we can address this later.

Apologies if I seem to add more work for merging https://github.com/
apache/arrow/pull/1164, that's not my intention, I think the PR looks good -
just want to bring up some major design decisions so people can comment and
discuss.

Li




On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia 
wrote:

> I am not quite sure of the need to rename the vectors. Why do we need to
> rename? This would first require us to remove all the vectors generated by
> FixedValueVectors.java as they are non-nullable scalar vectors. Removing
> non-nullable vectors is one of the goals, but it can be done once the new
> infrastructure is properly setup?
>
> In order to merge the existing patch, I first need to address some (10-15)
> failures -- few of them are correctness issues w.r.t TestVectorUnloadLoad,
> TestArrowFile and rest all are related to getMutator(), getAccessor()
> throwing UnsupportedOperationException. This is why I was saying earlier
> that I will end up doing a lot of rework by writing redundant code where
> (if vector instanceof NullableInt or vector instanceof NullableVarChar) we
> don't use the mutator/accessor and for other vectors we use it for the
> current patch. These if conditions are getting complicated with ugly type
> casting in some parts of the code --
> https://github.com/apache/arrow/pull/1164/files#diff-
> 0876c9a0005d1dbaea321ea8d39d79ae
>
> So I thought we can probably implement other vectors (remaining scalars,
> map and list) where no vector has mutator/accessor and then for every
> ValueVector, we can remove all calls to getMutator(), getAccessor() as
> opposed to doing them selectively ---
> https://github.com/apache/arrow/pull/1164/files#diff-
> e9273a7b3b35ff7f40f101dc2cf95242
>
> I will try to address these failures by EOD and see if this patch can be
> merged first.
>
> Regarding readers and writers, can we address them subsequently?
>
> On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:
>
> > Siddharth,
> >
> > Thanks for the update. I think it's fine to move forward with more
> vectors,
> > but in the mean time, I think we should also prioritize to merge
> > https://github.com/apache/arrow/pull/1164, here are a few comments needs
> > to
> > be addressed.
> >
> > (1) Backward-compatibility:
> > I think there is no way to maintain backward compability as the new
> vector
> > classes will be renamed, but want to confirm we are OK with this
> decision.
> > We also think the disruption on the Spark side are OK as Spark's use case
> > is simple and Bryan and I can take care of the code change.
> >
> > (2) Reader/writer classes:
> > How does the reader/writer classes interact with the new and legacy
> vector
> > classes:
> >
> > Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> > r144074264
> >
> > My thoughts are:
> > (1) ArrowReader classes should only return new vector classes
> > (2) ArrowWriter classes should only work with new vector classes
> > (3) To read/write legacy vectors, we can use adapters to turn legacy
> > vectors to new vectors (zero-copy, as the underlying buffers should be
> > transferred directly)
> >
> > Jacques also has a few comments, I don't know if they have been
> addressed.
> >
> > For other comments, I think we can add TODO and do it later. I think we
> can
> > merge this PR if we address (1) (2) above.
> >
> > Comments?
> >
> >
> >
> >
> >
> >
> > On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia  >
> > wrote:
> >
> > > The patch that I have put up https://github.com/apache/arrow/pull/1198
> > > seems to be in a reasonable state. We are now working off a different
> > > branch "java vector refactor".
> > >
> > > Now that we have the basic structure,  in order to make quick forward
> > > progress, I would like to go ahead and do for other types (FLOAT,
> BIGINT
> > > etc), list, map and create their legacy
> > > counter parts -- doing them in subsequent patches is requiring me to
> > write
> > > some duplicate code and redundant if conditions in code that expects
> all
> > > the vectors to have mutator/accessor.
> > >
> > > Is that fine? Just wanted to check with people and ensure there aren't
> > any
> > > major concerns.
> > >
> > > The feedback on the PR (original one for master
> > > https://github.com/apache/arrow/pull/1164) has been really good --
> some
> > of
> > > the comments are yet to be addressed and we jointly decided 

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia
I am not quite sure of the need to rename the vectors. Why do we need to
rename? This would first require us to remove all the vectors generated by
FixedValueVectors.java as they are non-nullable scalar vectors. Removing
non-nullable vectors is one of the goals, but it can be done once the new
infrastructure is properly setup?

In order to merge the existing patch, I first need to address some (10-15)
failures -- few of them are correctness issues w.r.t TestVectorUnloadLoad,
TestArrowFile and rest all are related to getMutator(), getAccessor()
throwing UnsupportedOperationException. This is why I was saying earlier
that I will end up doing a lot of rework by writing redundant code where
(if vector instanceof NullableInt or vector instanceof NullableVarChar) we
don't use the mutator/accessor and for other vectors we use it for the
current patch. These if conditions are getting complicated with ugly type
casting in some parts of the code --
https://github.com/apache/arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae

So I thought we can probably implement other vectors (remaining scalars,
map and list) where no vector has mutator/accessor and then for every
ValueVector, we can remove all calls to getMutator(), getAccessor() as
opposed to doing them selectively ---
https://github.com/apache/arrow/pull/1164/files#diff-e9273a7b3b35ff7f40f101dc2cf95242

I will try to address these failures by EOD and see if this patch can be
merged first.

Regarding readers and writers, can we address them subsequently?

On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:

> Siddharth,
>
> Thanks for the update. I think it's fine to move forward with more vectors,
> but in the mean time, I think we should also prioritize to merge
> https://github.com/apache/arrow/pull/1164, here are a few comments needs
> to
> be addressed.
>
> (1) Backward-compatibility:
> I think there is no way to maintain backward compability as the new vector
> classes will be renamed, but want to confirm we are OK with this decision.
> We also think the disruption on the Spark side are OK as Spark's use case
> is simple and Bryan and I can take care of the code change.
>
> (2) Reader/writer classes:
> How does the reader/writer classes interact with the new and legacy vector
> classes:
>
> Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> r144074264
>
> My thoughts are:
> (1) ArrowReader classes should only return new vector classes
> (2) ArrowWriter classes should only work with new vector classes
> (3) To read/write legacy vectors, we can use adapters to turn legacy
> vectors to new vectors (zero-copy, as the underlying buffers should be
> transferred directly)
>
> Jacques also has a few comments, I don't know if they have been addressed.
>
> For other comments, I think we can add TODO and do it later. I think we can
> merge this PR if we address (1) (2) above.
>
> Comments?
>
>
>
>
>
>
> On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia 
> wrote:
>
> > The patch that I have put up https://github.com/apache/arrow/pull/1198
> > seems to be in a reasonable state. We are now working off a different
> > branch "java vector refactor".
> >
> > Now that we have the basic structure,  in order to make quick forward
> > progress, I would like to go ahead and do for other types (FLOAT, BIGINT
> > etc), list, map and create their legacy
> > counter parts -- doing them in subsequent patches is requiring me to
> write
> > some duplicate code and redundant if conditions in code that expects all
> > the vectors to have mutator/accessor.
> >
> > Is that fine? Just wanted to check with people and ensure there aren't
> any
> > major concerns.
> >
> > The feedback on the PR (original one for master
> > https://github.com/apache/arrow/pull/1164) has been really good -- some
> of
> > the comments are yet to be addressed and we jointly decided to address
> few
> > things (like Minor Type etc) after the refactoring has been done.
> >
> > On the testing front, as far as the correctness is concerned, I have two
> > failures in TestArrowFile and TestValueVector. I have added some more
> tests
> > too.
> >
> >
> >
> >
> >
> > On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia 
> > wrote:
> >
> > > Yes, that is the intention. Good that we all are on the same page. I
> will
> > > move the PR https://github.com/apache/arrow/pull/1164 to new branch.
> > >
> > > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin 
> wrote:
> > >
> > >> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
> > >> deprecated API.
> > >>
> > >> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin 
> wrote:
> > >>
> > >> > Siddharth,
> > >> >
> > >> > For working off a branch, Wes has created
> https://github.com/apache/
> > >> > arrow/tree/java-vector-refactor that we can submit PR to.
> > >> >
> > >> > For Legacy vectors, I think it's fine because it's really just a
> > >> 

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Li Jin
Siddharth,

Thanks for the update. I think it's fine to move forward with more vectors,
but in the mean time, I think we should also prioritize to merge
https://github.com/apache/arrow/pull/1164, here are a few comments needs to
be addressed.

(1) Backward-compatibility:
I think there is no way to maintain backward compability as the new vector
classes will be renamed, but want to confirm we are OK with this decision.
We also think the disruption on the Spark side are OK as Spark's use case
is simple and Bryan and I can take care of the code change.

(2) Reader/writer classes:
How does the reader/writer classes interact with the new and legacy vector
classes:

Discussion: https://github.com/apache/arrow/pull/1164#discussion_r144074264

My thoughts are:
(1) ArrowReader classes should only return new vector classes
(2) ArrowWriter classes should only work with new vector classes
(3) To read/write legacy vectors, we can use adapters to turn legacy
vectors to new vectors (zero-copy, as the underlying buffers should be
transferred directly)

Jacques also has a few comments, I don't know if they have been addressed.

For other comments, I think we can add TODO and do it later. I think we can
merge this PR if we address (1) (2) above.

Comments?






On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia 
wrote:

> The patch that I have put up https://github.com/apache/arrow/pull/1198
> seems to be in a reasonable state. We are now working off a different
> branch "java vector refactor".
>
> Now that we have the basic structure,  in order to make quick forward
> progress, I would like to go ahead and do for other types (FLOAT, BIGINT
> etc), list, map and create their legacy
> counter parts -- doing them in subsequent patches is requiring me to write
> some duplicate code and redundant if conditions in code that expects all
> the vectors to have mutator/accessor.
>
> Is that fine? Just wanted to check with people and ensure there aren't any
> major concerns.
>
> The feedback on the PR (original one for master
> https://github.com/apache/arrow/pull/1164) has been really good -- some of
> the comments are yet to be addressed and we jointly decided to address few
> things (like Minor Type etc) after the refactoring has been done.
>
> On the testing front, as far as the correctness is concerned, I have two
> failures in TestArrowFile and TestValueVector. I have added some more tests
> too.
>
>
>
>
>
> On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia 
> wrote:
>
> > Yes, that is the intention. Good that we all are on the same page. I will
> > move the PR https://github.com/apache/arrow/pull/1164 to new branch.
> >
> > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin  wrote:
> >
> >> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
> >> deprecated API.
> >>
> >> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin  wrote:
> >>
> >> > Siddharth,
> >> >
> >> > For working off a branch, Wes has created https://github.com/apache/
> >> > arrow/tree/java-vector-refactor that we can submit PR to.
> >> >
> >> > For Legacy vectors, I think it's fine because it's really just a
> >> migration
> >> > path to help Dremio to migrate to the new vectors. I don't think other
> >> > users, i.e., Spark will use the Legacy vector class. Bryan and I will
> >> just
> >> > migrate Spark to new vectors directly because Spark's use of Arrow is
> >> very
> >> > simple.
> >> >
> >> >
> >> >
> >> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia <
> siddha...@dremio.com
> >> >
> >> > wrote:
> >> >
> >> >> Thanks Bryan and Li.
> >> >>
> >> >> Yes, the goal is to get this (and the subsequent patches) merged to
> the
> >> >> new
> >> >> branch. Once it is stabilized from different aspects, we can move to
> >> >> master. I am not sure of the exact mechanics when we work off a
> >> different
> >> >> project branch and not master.
> >> >>
> >> >> Does that sound good?
> >> >>
> >> >> Regarding compatibility, are we suggesting that let's not create
> Legacy
> >> >> Nullable vectors at all? The initial thoughts were to generate Legacy
> >> >> vectors from NullableValueVectors template and these vectors are
> >> >> mutator/accessor based (in today's world). Internally each operation
> >> will
> >> >> be delegated to new vectors (non code generated).
> >> >>
> >> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler 
> >> wrote:
> >> >>
> >> >> > Thanks for the update Siddharth.  From the Spark side of this, I
> >> >> definitely
> >> >> > want to try to upgrade to the latest Arrow before the Spark 2.3
> >> release
> >> >> but
> >> >> > if it the refactor is too disruptive then others might get
> squeamish
> >> >> about
> >> >> > upgrading.  On the other hand, I don't think we should hold back on
> >> >> > refactoring for compatibility sake and the way it's looking now
> >> trying
> >> >> to
> >> >> > be backwards-compatible will be too much of a pain.  I 

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia
The patch that I have put up https://github.com/apache/arrow/pull/1198
seems to be in a reasonable state. We are now working off a different
branch "java vector refactor".

Now that we have the basic structure,  in order to make quick forward
progress, I would like to go ahead and do for other types (FLOAT, BIGINT
etc), list, map and create their legacy
counter parts -- doing them in subsequent patches is requiring me to write
some duplicate code and redundant if conditions in code that expects all
the vectors to have mutator/accessor.

Is that fine? Just wanted to check with people and ensure there aren't any
major concerns.

The feedback on the PR (original one for master
https://github.com/apache/arrow/pull/1164) has been really good -- some of
the comments are yet to be addressed and we jointly decided to address few
things (like Minor Type etc) after the refactoring has been done.

On the testing front, as far as the correctness is concerned, I have two
failures in TestArrowFile and TestValueVector. I have added some more tests
too.





On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia 
wrote:

> Yes, that is the intention. Good that we all are on the same page. I will
> move the PR https://github.com/apache/arrow/pull/1164 to new branch.
>
> On Thu, Oct 12, 2017 at 11:20 AM, Li Jin  wrote:
>
>> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
>> deprecated API.
>>
>> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin  wrote:
>>
>> > Siddharth,
>> >
>> > For working off a branch, Wes has created https://github.com/apache/
>> > arrow/tree/java-vector-refactor that we can submit PR to.
>> >
>> > For Legacy vectors, I think it's fine because it's really just a
>> migration
>> > path to help Dremio to migrate to the new vectors. I don't think other
>> > users, i.e., Spark will use the Legacy vector class. Bryan and I will
>> just
>> > migrate Spark to new vectors directly because Spark's use of Arrow is
>> very
>> > simple.
>> >
>> >
>> >
>> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia > >
>> > wrote:
>> >
>> >> Thanks Bryan and Li.
>> >>
>> >> Yes, the goal is to get this (and the subsequent patches) merged to the
>> >> new
>> >> branch. Once it is stabilized from different aspects, we can move to
>> >> master. I am not sure of the exact mechanics when we work off a
>> different
>> >> project branch and not master.
>> >>
>> >> Does that sound good?
>> >>
>> >> Regarding compatibility, are we suggesting that let's not create Legacy
>> >> Nullable vectors at all? The initial thoughts were to generate Legacy
>> >> vectors from NullableValueVectors template and these vectors are
>> >> mutator/accessor based (in today's world). Internally each operation
>> will
>> >> be delegated to new vectors (non code generated).
>> >>
>> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler 
>> wrote:
>> >>
>> >> > Thanks for the update Siddharth.  From the Spark side of this, I
>> >> definitely
>> >> > want to try to upgrade to the latest Arrow before the Spark 2.3
>> release
>> >> but
>> >> > if it the refactor is too disruptive then others might get squeamish
>> >> about
>> >> > upgrading.  On the other hand, I don't think we should hold back on
>> >> > refactoring for compatibility sake and the way it's looking now
>> trying
>> >> to
>> >> > be backwards-compatible will be too much of a pain.  I will try to
>> >> figure
>> >> > out the timeline for Spark 2.3 and what the feeling is for upgrading
>> >> > Arrow.  Can we hold off on merging this to master for now and just
>> work
>> >> out
>> >> > of the separate branch until we can get a better feeling for the
>> impact?
>> >> >
>> >> > Bryan
>> >> >
>> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin 
>> wrote:
>> >> >
>> >> > > Hi Siddharth,
>> >> > >
>> >> > > Thanks for the update. This looks good.
>> >> > >
>> >> > > A few thoughts:
>> >> > >
>> >> > > *Compatibility:*
>> >> > > It sounds like we will introduce some back-compatibility with the
>> new
>> >> > > Vector class. At this point I think our main Java users should be
>> >> Spark
>> >> > and
>> >> > > Dremio, is this right?
>> >> > >
>> >> > >
>> >> > >- For Spark:
>> >> > >
>> >> > > It seems fine since Spark uses just the basic functionality of
>> Vector
>> >> > > classes and the existing code should work with the new Vector
>> classes,
>> >> > > maybe even without any code change on the Spark side.
>> >> > >
>> >> > >
>> >> > >- For Dremio:
>> >> > >
>> >> > > Sounds like you are already taking care of this by introducing the
>> >> > > LegacyVector classes.
>> >> > >
>> >> > >
>> >> > > *Testing:*
>> >> > >
>> >> > >- Spark Integration Tests:
>> >> > >
>> >> > > Bryan and I can help with integration test with Spark. I think the
>> >> target
>> >> > > timeline for Spark 2.3 release is some time in mid Nov (Bryan
>> please
>> >> > > correct me if I am