from:"Siddharth Teotia"

Re: Add support for Decimal64

2021-11-23 Thread siddharth teotia

If the timeline is not tight, I can help with Java side implementation.
IIRC, we already have have 16 byte and 32 byte 2's complement based decimal
vector implementations in Java based off BigDecimal.

Is this similar work for 4 and 8 byte implementations? I will have to
refresh my memory of code but can help do it over period of time from Jan
onwards.

In any case, I will definitely have bandwidth to help review the work if
someone else wants to do it sooner.

On Tue, Nov 23, 2021, 7:52 PM Micah Kornfield  wrote:

> It would be nice to round off support.  I think there might be a stake PR
> that started in C++ side.
>
> Unfortunately, I do but have bandwidth to help with the effort.
>
> On Tuesday, November 23, 2021, Wang Xudong  wrote:
>
> > Yes, It's very nice to add 32-bit and 64-bit decimal support to Arrow. If
> > we decide to do it, I think I can help to work on C++ support.
> >
> > --
> > xudong963
> >
> > Wes McKinney  于2021年11月24日周三 上午11:28写道：
> >
> > > I think we should consider adding 32-bit and 64-bit decimal support to
> > > Arrow — this also needs to be added at the specification level and we
> > > would need volunteers to work on Java and C++ support as well as
> > > integration testing. What do others think?
> > >
> > > On Thu, Nov 18, 2021 at 11:13 PM Stephen Jiang <
> syuanjiang...@gmail.com>
> > > wrote:
> > > >
> > > > We are heavy users of Arrow; and need support for Decimal64.
> Currently
> > > we
> > > > have to use Decimal128, for small decimals.  Any update on when
> > > ARROW-9404
> > > > will be worked on?
> > > >
> > > > Thanks
> > > > Stephen
> > >
> >
>

Re: [ANNOUNCE] New Arrow PMC chair: Wes McKinney

2020-10-23 Thread siddharth teotia

Congratulations, Wes

On Fri, Oct 23, 2020, 4:40 PM Neal Richardson 
wrote:

> Congratulations, Wes!
>
> On Fri, Oct 23, 2020 at 4:35 PM Jacques Nadeau  wrote:
>
> > I am pleased to announce that we have a new PMC chair and VP as per our
> > newly started tradition of rotating the chair once a year. I have
> resigned
> > and Wes was duly elected by the PMC and approved unanimously by the
> board.
> >
> > Please join me in congratulating Wes!
> >
> > Jacques
> >
>

Re: Help with Java PR backlog

2020-06-12 Thread siddharth teotia

I can take a look as well.

On Thu, Jun 11, 2020, 7:18 PM Fan Liya  wrote:

> I would like to help with the review.
> I will spend some time on it late today.
>
> Best,
> Liya Fan
>
>
> On Fri, Jun 12, 2020 at 9:56 AM Wes McKinney  wrote:
>
> > hi folks,
> >
> > There's a number of Java PRs that seem like they are close to being in
> > a merge-ready state, could we try to get the Java backlog mostly
> > closed out before the next release (in a few weeks)?
> >
> > Thanks
> > Wes
> >
>

Re: [ANNOUNCE] New Arrow committers: Ji Liu and Liya Fan

2020-06-11 Thread siddharth teotia

Congratulations!

On Thu, Jun 11, 2020 at 7:51 AM Neal Richardson 
wrote:

> Congratulations, both!
>
> Neal
>
> On Thu, Jun 11, 2020 at 7:38 AM Wes McKinney  wrote:
>
> > On behalf of the Arrow PMC I'm happy to announce that Ji Liu and Liya
> > Fan have been invited to be Arrow committers and they have both
> > accepted.
> >
> > Welcome, and thank you for your contributions!
> >
>


-- 
*Best Regards,*
*SIDDHARTH TEOTIA*
*2008C6PS540G*
*BITS PILANI- GOA CAMPUS*

*+91 87911 75932*

Re: [Java] PR Reviewers

2020-02-02 Thread siddharth teotia

Sure thing.

I reviewed few PRs in the last six months due to job transition. Was mostly
merging the approved ones. I have recently resumed keeping a tab on PRs.

Thanks
Sidd
On Sun, Feb 2, 2020, 9:35 PM Micah Kornfield  wrote:

> Thanks Sidd!
>
> Feel free to jump on any of the more recent Java PRs (I think there are a
> few dealing directly trying to separate ArrowBuf from Netty which I believe
> builds off work you contributed in the past.  Those might be a good place
> to start).
>
> On Sun, Feb 2, 2020 at 9:20 PM siddharth teotia 
> wrote:
>
>> Hi All,
>>
>> I can help review Java PRs.
>>
>> Thanks
>> Sidd
>>
>>
>> On Sun, Feb 2, 2020, 8:37 PM Micah Kornfield 
>> wrote:
>>
>>> OK, I think I've triaged the open Java PRs.  Lets see how it goes.
>>>
>>> On Mon, Jan 27, 2020 at 11:13 PM Micah Kornfield 
>>> wrote:
>>>
>>> > Somewhat related, but are there any thoughts about growing the Java
>>> >> developer community generally? Perhaps we could do some outreach to
>>> >> other Java-focused Apache communities (Iceberg comes to mind, but
>>> >> there may be others)?
>>> >
>>> > I'm all for this.  I think one of the things that we are lacking a
>>> little
>>> > bit on the Java side of things is a clear idea of what we want to build
>>> > into Apache Arrow proper.  For instance, in the past, I've been -0.5
>>> > on trying to replicate the work that is on-going on the C++ side of
>>> things,
>>> > but maybe we should reconsider that? Or at least more JNI bindings?
>>> > Getting more input on this would be useful especially from those
>>> outside
>>> > the community.  I still think a strong set of adapter libraries,
>>> especially
>>> > if we can make them "best of class" in performance would be beneficial
>>> for
>>> > adoption.
>>> >
>>> > Not directly related, but it would be nice if Java contributors could
>>> >> fill the holes in the 0.16.0 release blog post.  Currently the Java
>>> >> section is empty:
>>> >> https://github.com/apache/arrow-site/pull/41
>>> >
>>> >
>>> > I put a few bullet points in.
>>> >
>>> > On Mon, Jan 27, 2020 at 11:08 AM Antoine Pitrou 
>>> > wrote:
>>> >
>>> >>
>>> >> Not directly related, but it would be nice if Java contributors could
>>> >> fill the holes in the 0.16.0 release blog post.  Currently the Java
>>> >> section is empty:
>>> >> https://github.com/apache/arrow-site/pull/41
>>> >>
>>> >> Regards
>>> >>
>>> >> Antoine.
>>> >>
>>> >>
>>> >> Le 27/01/2020 à 19:40, Ryan Murray a écrit :
>>> >> > Hey all, I would love to help out. Is there any specific ones that
>>> are
>>> >> > relatively easy for me to get started on?
>>> >> >
>>> >>
>>> >
>>>
>>

Re: [Java] PR Reviewers

2020-02-02 Thread siddharth teotia

Hi All,

I can help review Java PRs.

Thanks
Sidd


On Sun, Feb 2, 2020, 8:37 PM Micah Kornfield  wrote:

> OK, I think I've triaged the open Java PRs.  Lets see how it goes.
>
> On Mon, Jan 27, 2020 at 11:13 PM Micah Kornfield 
> wrote:
>
> > Somewhat related, but are there any thoughts about growing the Java
> >> developer community generally? Perhaps we could do some outreach to
> >> other Java-focused Apache communities (Iceberg comes to mind, but
> >> there may be others)?
> >
> > I'm all for this.  I think one of the things that we are lacking a little
> > bit on the Java side of things is a clear idea of what we want to build
> > into Apache Arrow proper.  For instance, in the past, I've been -0.5
> > on trying to replicate the work that is on-going on the C++ side of
> things,
> > but maybe we should reconsider that? Or at least more JNI bindings?
> > Getting more input on this would be useful especially from those outside
> > the community.  I still think a strong set of adapter libraries,
> especially
> > if we can make them "best of class" in performance would be beneficial
> for
> > adoption.
> >
> > Not directly related, but it would be nice if Java contributors could
> >> fill the holes in the 0.16.0 release blog post.  Currently the Java
> >> section is empty:
> >> https://github.com/apache/arrow-site/pull/41
> >
> >
> > I put a few bullet points in.
> >
> > On Mon, Jan 27, 2020 at 11:08 AM Antoine Pitrou 
> > wrote:
> >
> >>
> >> Not directly related, but it would be nice if Java contributors could
> >> fill the holes in the 0.16.0 release blog post.  Currently the Java
> >> section is empty:
> >> https://github.com/apache/arrow-site/pull/41
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 27/01/2020 à 19:40, Ryan Murray a écrit :
> >> > Hey all, I would love to help out. Is there any specific ones that are
> >> > relatively easy for me to get started on?
> >> >
> >>
> >
>

Re: ARROW-3191: Making ArrowBuf work with arbitrary memory and setting io.netty.tryReflectionSetAccessible to true for java builds

2019-05-06 Thread Siddharth Teotia

Hi Bryan,

AFAIK, there is not other impact. So we should be good.

The last few integration issues that I had been chasing are now fixed (got
a clean build with my previous commit pushed over the weekend). I just
pushed a new commit with some cleanup and the changes are now ready. We
should plan to merge this asap this week.

Thanks,
Siddharth

On Fri, May 3, 2019 at 10:21 AM Bryan Cutler  wrote:

> Hi Sidd,
>
> Does setting the system property io.netty.tryReflectionSetAccessible to
> true have any other adverse effect other than those warnings during build?
>
> Bryan
>
> On Thu, May 2, 2019 at 8:43 PM Jacques Nadeau  wrote:
>
> > I'm onboard with this change.
> >
> > On Fri, Apr 26, 2019 at 2:14 AM Siddharth Teotia 
> > wrote:
> >
> > > As part of working on this patch <
> > > https://github.com/apache/arrow/pull/4151>,
> > > I ran into a problem with jdk 9 and 11 builds.  Since memory underlying
> > > ArrowBuf may not necessarily be a ByteBuf (or any of its extensions),
> > > methods like nioBuffer() can no longer be delegated as
> > > UnsafeDirectLittleEndian.nioBuffer() to Netty implementation.
> > >
> > > So I used PlatformDependent.directBuffer(memory address, size) to
> create
> > a
> > > direct Byte Buffer  to closely mimic what Netty was originally doing
> > > underneath for nioBuffer(). It turns out that PlatformDependent code in
> > > netty first checks for the existence of constructor
> DirectByteBuffer(long
> > > address, int size) as seen here
> > > <
> > >
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L223
> > > >.
> > > The constructor (long address, int size) is very well available in jdk
> > 8, 9
> > > and 11 but on the next line it tries to set it accessible. The
> reflection
> > > based access is disabled by default in netty code for jdk >= 9 as seen
> > here
> > > <
> > >
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L829
> > > >.
> > > Thus calls to PlatformDependent.directBuffer(address, size) were
> failing
> > in
> > > travis CI builds for JDK 9 and 11 with UnsupportedOperationException as
> > > seen here
> > > <
> > >
> >
> https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L415
> > > >
> > > and
> > > this was because of the decision that was taken by netty at startup
> w.r.t
> > > whether to provide access to constructor or not.
> > >
> > > We should set io.netty.tryReflectionSetAccessible system property to
> true
> > > in java root pom
> > >
> > > I want to make sure people are aware and agree/disagree with this
> change.
> > >
> > > The tests now give the following warning:
> > >
> > > WARNING: An illegal reflective access operation has occurred
> > > WARNING: Illegal reflective access by
> > io.netty.util.internal.ReflectionUtil
> > >
> > >
> >
> (file:/Users/siddharthteotia/.m2/repository/io/netty/netty-common/4.1.22.Final/netty-common-4.1.22.Final.jar)
> > > to constructor java.nio.DirectByteBuffer(long,int)
> > > WARNING: Please consider reporting this to the maintainers of
> > > io.netty.util.internal.ReflectionUtil
> > > WARNING: Use --illegal-access=warn to enable warnings of further
> illegal
> > > reflective access operations
> > > WARNING: All illegal access operations will be denied in a future
> release
> > >
> > > Thanks.
> > > On Thu, Apr 18, 2019 at 3:39 PM Siddharth Teotia  >
> > > wrote:
> > >
> > > > I  have made all the necessary changes in java code to work with new
> > > > ArrowBuf, ReferenceManager interfaces. More importantly, there is a
> > > wrapper
> > > > buffer NettyArrowBuf interface to comply with usage in RPC and Netty
> > > > related code. It will be good to get feedback on this one (and of
> > course
> > > > all other changes).  As of now, the java modules build fine but I
> have
> > to
> > > > fix test failures. That is in progress.
> > > >
> > > > On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau 
> > > wrote:
> > > >
> > > >> Are there any other general comments here? If not, let's get this
> done
> > > and
> > > >> m

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

2019-05-01 Thread Siddharth Teotia

Looks like there are 2 PRs for this work --
https://github.com/apache/arrow/pull/4186 this PR adds new getUnsafe
type APIs to ArrowBuf that don't do checkIndex() before calling
PlatformDependent.get(memory address). So the access will go through
vector.get() -> buffer.get() -> PlatformDependent.get() -> UNSAFE.get which
is what we do today but without doing any bounds checking

I believe the proposal suggested here and the WIP PR --
https://github.com/apache/arrow/pull/4212 adds new versions of vectors
where the call to vector.get() bypasses the call to ArrowBuf and directly
invokes PlatformDependent with absolute address at which we want to
read/write. Correct? Looks like the call to arrowbuf is still needed to get
the starting address of buffer before computing the absolute address

I am wondering if much of the overhead is coming from conditions and
branches inside bound checking or just the chain of method calls? If it is
bounds checking then I think the first PR would suffice probably.

On Tue, Apr 30, 2019 at 9:46 AM Parth Chandra  wrote:

> FWIW, in Drill's Value Vector code, we found that bounds checking was a
> major performance bottleneck in operators that wrote to vectors. Scans, as
> a result, we particularly affected. Another bottleneck was the zeroing of
> vectors.
> There were many unnecessary bounds checks. For example in a varchar vector,
> there is one check while writing the data, one while writing the validity
> bit, one more in the buffer allocator for the data buffer, one more in the
> buffer allocator for the validity bit buffer, one more each in the
> underlying ByteBuf implementation. It gets worse with repeated/array types.
> Some code paths in Drill were optimized to get rid of these bounds checks
> (eventually I suppose all of them will be updated). The approach was to
> bypass the ValueVector API and write directly to the Drill(/Arrow)Buf.
> Writing to the memory address directly, as is being proposed by Liya Fan,
> was initially tried but did not have any measurable performance
> improvements. BTW, writing to the memory address would also conflict with
> ARROW-3191.
> Note that the performance tests were for Drill queries, not Vectors, so
> writing to memory directly may still have a noticeable performance benefit
> for different use cases.
> Sorry, I don't have actual numbers with me to share and I'm not sure how
> much Arrow has diverged from the original Drill implementation, but the
> Drill experience would suggest that this proposal certainly has merit.
>
> Parth
>
> On Mon, Apr 29, 2019 at 11:18 AM Wes McKinney  wrote:
>
> > I'm also curious which APIs are particularly problematic for
> > performance. In ARROW-1833 [1] and some related discussions there was
> > the suggestion of adding methods like getUnsafe, so this would be like
> > get(i) [2] but without checking the validity bitmap
> >
> > [1] : https://issues.apache.org/jira/browse/ARROW-1833
> > [2]:
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99
> >
> > On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield 
> > wrote:
> > >
> > > Thanks for the design.   Personally, I'm not a huge fan of creating a
> > > parallel classes for every vector type, this ends up being confusing
> for
> > > developers and adds a lot of boiler plate.  I wonder if you could use a
> > > similar approach that the memory module uses for turning bounds
> checking
> > > on/off [1].
> > >
> > > Also, I think there was a comment on the JIRA, but are there any
> > benchmarks
> > > to show the expected improvements?  My limited understanding is that
> for
> > > small methods the JVM's JIT should inline them anyways [2] , so it is
> not
> > > clear how much this will improve performance.
> > >
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> > > [2]
> > >
> >
> https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f
> > >
> > > On Sun, Apr 28, 2019 at 2:50 AM Fan Liya  wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> > > > general ideas is attached below, and also accessible from our online
> > > > document
> > > > <
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > >.
> > > > Please give your valuable comments by directly commenting in our
> online
> > > > document
> > > > <
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > >,
> > > > or relaying this email thread.
> > > >
> > > > Thank you so much in advance.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > Support Fast/Unsafe Vector APIs for Arrow Background
> > > >
> > > > In our effort to support columnar data format in Apache Flink, we
> chos

Re: ARROW-3191: Status update: Making ArrowBuf work with arbitrary memory

2019-05-01 Thread Siddharth Teotia

Quick status update: I have 2 outstanding integration test failures
<https://travis-ci.org/apache/arrow/builds/524723636?utm_source=github_status&utm_medium=notification>that
need to be addressed -- was out for a couple of days and then got dragged
into another issue. Looking into the failures now. I hope people have
looked at my previous email for the change I had made to get the jdk >= 9
builds passing.

On Thu, Apr 25, 2019 at 3:13 PM Siddharth Teotia 
wrote:

> As part of working on this patch
> <https://github.com/apache/arrow/pull/4151>, I ran into a problem with
> jdk 9 and 11 builds.  Since memory underlying ArrowBuf may not necessarily
> be a ByteBuf (or any of its extensions), methods like nioBuffer() can no
> longer be delegated as UnsafeDirectLittleEndian.nioBuffer() to Netty
> implementation.
>
> So I used PlatformDependent.directBuffer(memory address, size) to create a
> direct Byte Buffer  to closely mimic what Netty was originally doing
> underneath for nioBuffer(). It turns out that PlatformDependent code in
> netty first checks for the existence of constructor DirectByteBuffer(long
> address, int size) as seen here
> <https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L223>.
> The constructor (long address, int size) is very well available in jdk 8, 9
> and 11 but on the next line it tries to set it accessible. The reflection
> based access is disabled by default in netty code for jdk >= 9 as seen
> here
> <https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L829>.
> Thus calls to PlatformDependent.directBuffer(address, size) were failing in
> travis CI builds for JDK 9 and 11 with UnsupportedOperationException as
> seen here
> <https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L415>
>  and
> this was because of the decision that was taken by netty at startup w.r.t
> whether to provide access to constructor or not.
>
> We should set io.netty.tryReflectionSetAccessible system property to true
> in java root pom
>
> I want to make sure people are aware and agree/disagree with this change.
>
> The tests now give the following warning:
>
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by
> io.netty.util.internal.ReflectionUtil
> (file:/Users/siddharthteotia/.m2/repository/io/netty/netty-common/4.1.22.Final/netty-common-4.1.22.Final.jar)
> to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of
> io.netty.util.internal.ReflectionUtil
> WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
>
> Thanks.
> On Thu, Apr 18, 2019 at 3:39 PM Siddharth Teotia 
> wrote:
>
>> I  have made all the necessary changes in java code to work with new
>> ArrowBuf, ReferenceManager interfaces. More importantly, there is a wrapper
>> buffer NettyArrowBuf interface to comply with usage in RPC and Netty
>> related code. It will be good to get feedback on this one (and of course
>> all other changes).  As of now, the java modules build fine but I have to
>> fix test failures. That is in progress.
>>
>> On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau 
>> wrote:
>>
>>> Are there any other general comments here? If not, let's get this done
>>> and
>>> merged.
>>>
>>> On Mon, Apr 15, 2019, 4:19 PM Siddharth Teotia 
>>> wrote:
>>>
>>> > I believe reader/writer indexes are typically used when we send buffers
>>> > over the wire -- so may not be necessary for all users of ArrowBuf.  I
>>> am
>>> > okay with the idea of providing a simple wrapper to ArrowBuf to manage
>>> the
>>> > reader/writer indexes with a couple of APIs. Note that some APIs like
>>> > writeInt, writeLong() bump the writer index unlike setInt/setLong
>>> > counterparts. JsonFileReader uses some of these APIs.
>>> >
>>> >
>>> >
>>> > On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau 
>>> wrote:
>>> >
>>> > > Hey Sidd,
>>> > >
>>> > > Thanks for pulling this together. This looks very promising. One
>>> quick
>>> > > thought: do we think the concept of the reader and writer index need
>>> to
>>> > be
>>> > > on ArrowBuf? It seems like something that could be added as an
>>>

Re: ARROW-3191: Making ArrowBuf work with arbitrary memory and setting io.netty.tryReflectionSetAccessible to true for java builds

2019-04-25 Thread Siddharth Teotia

As part of working on this patch <https://github.com/apache/arrow/pull/4151>,
I ran into a problem with jdk 9 and 11 builds.  Since memory underlying
ArrowBuf may not necessarily be a ByteBuf (or any of its extensions),
methods like nioBuffer() can no longer be delegated as
UnsafeDirectLittleEndian.nioBuffer() to Netty implementation.

So I used PlatformDependent.directBuffer(memory address, size) to create a
direct Byte Buffer  to closely mimic what Netty was originally doing
underneath for nioBuffer(). It turns out that PlatformDependent code in
netty first checks for the existence of constructor DirectByteBuffer(long
address, int size) as seen here
<https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L223>.
The constructor (long address, int size) is very well available in jdk 8, 9
and 11 but on the next line it tries to set it accessible. The reflection
based access is disabled by default in netty code for jdk >= 9 as seen here
<https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent0.java#L829>.
Thus calls to PlatformDependent.directBuffer(address, size) were failing in
travis CI builds for JDK 9 and 11 with UnsupportedOperationException as
seen here
<https://github.com/netty/netty/blob/4.1/common/src/main/java/io/netty/util/internal/PlatformDependent.java#L415>
and
this was because of the decision that was taken by netty at startup w.r.t
whether to provide access to constructor or not.

We should set io.netty.tryReflectionSetAccessible system property to true
in java root pom

I want to make sure people are aware and agree/disagree with this change.

The tests now give the following warning:

WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by io.netty.util.internal.ReflectionUtil
(file:/Users/siddharthteotia/.m2/repository/io/netty/netty-common/4.1.22.Final/netty-common-4.1.22.Final.jar)
to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of
io.netty.util.internal.ReflectionUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release

Thanks.
On Thu, Apr 18, 2019 at 3:39 PM Siddharth Teotia 
wrote:

> I  have made all the necessary changes in java code to work with new
> ArrowBuf, ReferenceManager interfaces. More importantly, there is a wrapper
> buffer NettyArrowBuf interface to comply with usage in RPC and Netty
> related code. It will be good to get feedback on this one (and of course
> all other changes).  As of now, the java modules build fine but I have to
> fix test failures. That is in progress.
>
> On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau  wrote:
>
>> Are there any other general comments here? If not, let's get this done and
>> merged.
>>
>> On Mon, Apr 15, 2019, 4:19 PM Siddharth Teotia 
>> wrote:
>>
>> > I believe reader/writer indexes are typically used when we send buffers
>> > over the wire -- so may not be necessary for all users of ArrowBuf.  I
>> am
>> > okay with the idea of providing a simple wrapper to ArrowBuf to manage
>> the
>> > reader/writer indexes with a couple of APIs. Note that some APIs like
>> > writeInt, writeLong() bump the writer index unlike setInt/setLong
>> > counterparts. JsonFileReader uses some of these APIs.
>> >
>> >
>> >
>> > On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau 
>> wrote:
>> >
>> > > Hey Sidd,
>> > >
>> > > Thanks for pulling this together. This looks very promising. One quick
>> > > thought: do we think the concept of the reader and writer index need
>> to
>> > be
>> > > on ArrowBuf? It seems like something that could be added as an
>> additional
>> > > decoration/wrapper when needed instead of being part of the core
>> > structure.
>> > >
>> > > On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia <
>> siddha...@dremio.com>
>> > > wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > I have put a PR with WIP changes. All the major set of changes have
>> > been
>> > > > done to decouple the usage of ArrowBuf and reference management. The
>> > > > ArrowBuf interface is much simpler and clean now.
>> > > >
>> > > > I believe there would be several folks in the community interested
>> in
>> > > these
>> > > > changes so please feel free to take a look at the PR and provide
>> your
>> > > > feedback -- https://github.com/apache/arrow/pull/4151
>> > > >
>> > > > There is some cleanup needed (code doesn't compile yet) due to
>> moving
>> > the
>> > > > APIs but I have raised the PR to get an early feedback from the
>> > community
>> > > > on the critical changes.
>> > > >
>> > > > Thanks,
>> > > > Siddharth
>> > > >
>> > >
>> >
>>
>

Re: ARROW-3191: Making ArrowBuf work with arbitrary memory

2019-04-18 Thread Siddharth Teotia

I  have made all the necessary changes in java code to work with new
ArrowBuf, ReferenceManager interfaces. More importantly, there is a wrapper
buffer NettyArrowBuf interface to comply with usage in RPC and Netty
related code. It will be good to get feedback on this one (and of course
all other changes).  As of now, the java modules build fine but I have to
fix test failures. That is in progress.

On Wed, Apr 17, 2019 at 6:41 AM Jacques Nadeau  wrote:

> Are there any other general comments here? If not, let's get this done and
> merged.
>
> On Mon, Apr 15, 2019, 4:19 PM Siddharth Teotia 
> wrote:
>
> > I believe reader/writer indexes are typically used when we send buffers
> > over the wire -- so may not be necessary for all users of ArrowBuf.  I am
> > okay with the idea of providing a simple wrapper to ArrowBuf to manage
> the
> > reader/writer indexes with a couple of APIs. Note that some APIs like
> > writeInt, writeLong() bump the writer index unlike setInt/setLong
> > counterparts. JsonFileReader uses some of these APIs.
> >
> >
> >
> > On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau 
> wrote:
> >
> > > Hey Sidd,
> > >
> > > Thanks for pulling this together. This looks very promising. One quick
> > > thought: do we think the concept of the reader and writer index need to
> > be
> > > on ArrowBuf? It seems like something that could be added as an
> additional
> > > decoration/wrapper when needed instead of being part of the core
> > structure.
> > >
> > > On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia <
> siddha...@dremio.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I have put a PR with WIP changes. All the major set of changes have
> > been
> > > > done to decouple the usage of ArrowBuf and reference management. The
> > > > ArrowBuf interface is much simpler and clean now.
> > > >
> > > > I believe there would be several folks in the community interested in
> > > these
> > > > changes so please feel free to take a look at the PR and provide your
> > > > feedback -- https://github.com/apache/arrow/pull/4151
> > > >
> > > > There is some cleanup needed (code doesn't compile yet) due to moving
> > the
> > > > APIs but I have raised the PR to get an early feedback from the
> > community
> > > > on the critical changes.
> > > >
> > > > Thanks,
> > > > Siddharth
> > > >
> > >
> >
>

Re: ARROW-3191: Making ArrowBuf work with arbitrary memory

2019-04-15 Thread Siddharth Teotia

I believe reader/writer indexes are typically used when we send buffers
over the wire -- so may not be necessary for all users of ArrowBuf.  I am
okay with the idea of providing a simple wrapper to ArrowBuf to manage the
reader/writer indexes with a couple of APIs. Note that some APIs like
writeInt, writeLong() bump the writer index unlike setInt/setLong
counterparts. JsonFileReader uses some of these APIs.

On Sat, Apr 13, 2019 at 2:42 PM Jacques Nadeau  wrote:

> Hey Sidd,
>
> Thanks for pulling this together. This looks very promising. One quick
> thought: do we think the concept of the reader and writer index need to be
> on ArrowBuf? It seems like something that could be added as an additional
> decoration/wrapper when needed instead of being part of the core structure.
>
> On Sat, Apr 13, 2019 at 11:26 AM Siddharth Teotia 
> wrote:
>
> > Hi All,
> >
> > I have put a PR with WIP changes. All the major set of changes have been
> > done to decouple the usage of ArrowBuf and reference management. The
> > ArrowBuf interface is much simpler and clean now.
> >
> > I believe there would be several folks in the community interested in
> these
> > changes so please feel free to take a look at the PR and provide your
> > feedback -- https://github.com/apache/arrow/pull/4151
> >
> > There is some cleanup needed (code doesn't compile yet) due to moving the
> > APIs but I have raised the PR to get an early feedback from the community
> > on the critical changes.
> >
> > Thanks,
> > Siddharth
> >
>

ARROW-3191: Making ArrowBuf work with arbitrary memory

2019-04-13 Thread Siddharth Teotia

Hi All,

I have put a PR with WIP changes. All the major set of changes have been
done to decouple the usage of ArrowBuf and reference management. The
ArrowBuf interface is much simpler and clean now.

I believe there would be several folks in the community interested in these
changes so please feel free to take a look at the PR and provide your
feedback -- https://github.com/apache/arrow/pull/4151

There is some cleanup needed (code doesn't compile yet) due to moving the
APIs but I have raised the PR to get an early feedback from the community
on the critical changes.

Thanks,
Siddharth

Re: [VOTE] Proposed change to Arrow Flight protocol: endpoint URIs

2019-04-09 Thread Siddharth Teotia

+1 (binding)

On Tue, Apr 9, 2019 at 9:53 PM Kouhei Sutou  wrote:

> +1 (binding)
>
> In 
>   "[VOTE] Proposed change to Arrow Flight protocol: endpoint URIs" on Mon,
> 8 Apr 2019 20:36:26 +0200,
>   Antoine Pitrou  wrote:
>
> >
> > Hello,
> >
> > David Li has proposed to make the following change to the Flight gRPC
> > service definition, as explained in this document:
> >
> https://docs.google.com/document/d/1Eps9eHvBc_qM8nRsTVwVCuWwHoEtQ-a-8Lv5dswuQoM/
> >
> > The proposed change is to replace (host, port) pairs to identify
> > endpoints with RFC 3986-compliant URIs.  This will help describe with
> > much more flexibility how a given Flight stream can be reached, for
> > example by allowing different transport protocols (gRPC over TLS or Unix
> > sockets can be reasonably implemented, but in the future we may also
> > want to implement transport protocols that are not gRPC-based, for
> > example a REST protocol directly over HTTP).
> >
> > An example URI is "grpc+tcp://192.168.0.1:3337".
> >
> > Please vote whether to accept the changes. The vote will be open for at
> > least 72 hours.
> >
> > [ ] +1 Accept this change to the Flight protocol
> > [ ] +0
> > [ ] -1 Do not accept the changes because...
> >
> > Best regards
> >
> > Antoine.
>

Re: Java allocate buffer code

2019-03-28 Thread Siddharth Teotia

Hitesh,

I suggest you file a JIRA for the potential issue you are seeing and if
possible raise a PR with a test case that you think is broken with current
code. Happy to discuss on Jira or PR.

Thanks,
Siddharth

On Thu, Mar 28, 2019 at 11:20 AM Hitesh  wrote:

> Hi Siddharth:
>
> Here, I see a problem in line#162, where its taking "bufferSize" to find
> the extra allocated bytes. It should be "valueCount*typeWidth +
> valueCount/8".
>
> Here is an example for that. Let's take 1000 ints. Then,
> valueCount = 1000 ints
> typWidth = 4 bytes
> validitiyBufferSize = 125 bytes
> valueBufferSize = 4000 bytes
> combinedSize(valueBufferSize + validityBufferSize) = 4128 bytes (multiple
> of 8)
> combinedSizeWith2ThePowerSize = 8192 bytes, this will be "bufferSize" at
> line#152.
>
> With the above calculation, this code should release
> (combinedSizeWith2ThePowerSize - combinedSize) = 4064 bytes. But, this is
> not happening.
>
> let me know if this example helps. Do we have some other channel to talk?
>
> Thanks.
> Hitesh.
>
>
>
>
>
>
> On Thursday, March 28, 2019, 10:59:18 AM PDT, Siddharth Teotia <
> siddha...@dremio.com> wrote:
>
>
>
>
>
> Hitesh,
>
> Yes, if you see in the code, the sliced buffers have their reference
> counts bumped up before the compound buffer is released. Bumping up the
> reference counts of child/sliced buffers allows us to release the compound
> buffer safely. Does that make sense?
>
> Thanks,
> Siddharth
>
> On Wed, Mar 27, 2019 at 12:45 PM Hitesh 
> wrote:
> >  Hi Siddarth:
> > Thanks. yes, I am referring compound buffer as an extra buffer. This we
> release and further can be reused?
> > Let's take an example of 1000 ints.
> > Then, it will need the following bytes.
> > getValidityBufferSize: 125value bufferSize: 4000combinedSize:
> 4128combinedSizeWith2ThePower: 8192
> > Then, that code should release (combinedSizeWith2ThePower - CombinedSize
> = 4064) bytes? I think thats the intention of that code but its considering
> other calculated value.
> > Please let me know what you think about it?
> > Thanks.Hitesh.
> >On Wednesday, March 27, 2019, 12:23:47 PM PDT, Siddharth Teotia <
> siddha...@dremio.com> wrote:
> >
> >  Hi Hitesh,
> >
> > The code you referenced allocates data and validity buffers for a fixed
> > width vector. It first determines the appropriate buffer size for a given
> > value count and then allocates a compound buffer. The compound buffer is
> > then sliced to get data and validity buffers and finally compound buffer
> is
> > released. Were you referring to compound buffer as extra buffer?
> >
> > Also, actualCount can't be equal to valueCount * typeWidth since it
> > represents the number of values that can be stored in the vector.
> > valueCount * typeWidth will give you the buffer size to be allocated for
> a
> > certain value count and data type.
> >
> > Thanks,
> > Siddharth
> >
> > On Wed, Mar 27, 2019 at 11:58 AM Hitesh Khamesra
> >  wrote:
> >
> >> Hi All:
> >> I was looking following code to release extra allocated buffer. It seems
> >> it should be considering actualCount as "valueCount*typeWidth". Then it
> >> should calculate extra buffer and release it. Right now, it calculates
> >> based on actually allocated size and not justifying the intend. Any
> >> thought??
> >> ===line 162 at "
> >>
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java
> >> "
> >> protected DataAndValidityBuffers allocFixedDataAndValidityBufs(int
> >> valueCount, int typeWidth) {
> >>  long bufferSize = computeCombinedBufferSize(valueCount, typeWidth);
> >>  assert bufferSize < MAX_ALLOCATION_SIZE;
> >>
> >>  int validityBufferSize;
> >>  int dataBufferSize;
> >>  if (typeWidth == 0) {
> >>validityBufferSize = dataBufferSize = (int) (bufferSize / 2);
> >>  } else {
> >>// Due to roundup to power-of-2 allocation, the bufferSize could be
> >> greater than the
> >>// requested size. Utilize the allocated buffer fully.;
> >>int actualCount = (int) ((bufferSize * 8.0) / (8 * typeWidth + 1));
> >>do {
> >>  validityBufferSize = (int)
> >> roundUp8(getValidityBufferSizeFromCount(actualCount));
> >>  dataBufferSize = (int) roundUp8(actualCount * typeWidth);
> >>  if (validityBufferSize + dataBufferSize <= bufferSize) {
> >>break;
> >>  }
> >>  --actualCount;
> >>} while (true);
> >>  }
> >> 
> >> Thanks.Hitesh.
> >>
> >
>

Re: Java allocate buffer code

2019-03-28 Thread Siddharth Teotia

Hitesh,

Yes, if you see in the code, the sliced buffers have their reference counts
bumped up before the compound buffer is released. Bumping up the reference
counts of child/sliced buffers allows us to release the compound buffer
safely. Does that make sense?

Thanks,
Siddharth

On Wed, Mar 27, 2019 at 12:45 PM Hitesh  wrote:

>  Hi Siddarth:
> Thanks. yes, I am referring compound buffer as an extra buffer. This we
> release and further can be reused?
> Let's take an example of 1000 ints.
> Then, it will need the following bytes.
> getValidityBufferSize: 125value bufferSize: 4000combinedSize:
> 4128combinedSizeWith2ThePower: 8192
> Then, that code should release (combinedSizeWith2ThePower - CombinedSize =
> 4064) bytes? I think thats the intention of that code but its considering
> other calculated value.
> Please let me know what you think about it?
> Thanks.Hitesh.
>On Wednesday, March 27, 2019, 12:23:47 PM PDT, Siddharth Teotia <
> siddha...@dremio.com> wrote:
>
>  Hi Hitesh,
>
> The code you referenced allocates data and validity buffers for a fixed
> width vector. It first determines the appropriate buffer size for a given
> value count and then allocates a compound buffer. The compound buffer is
> then sliced to get data and validity buffers and finally compound buffer is
> released. Were you referring to compound buffer as extra buffer?
>
> Also, actualCount can't be equal to valueCount * typeWidth since it
> represents the number of values that can be stored in the vector.
> valueCount * typeWidth will give you the buffer size to be allocated for a
> certain value count and data type.
>
> Thanks,
> Siddharth
>
> On Wed, Mar 27, 2019 at 11:58 AM Hitesh Khamesra
>  wrote:
>
> > Hi All:
> > I was looking following code to release extra allocated buffer. It seems
> > it should be considering actualCount as "valueCount*typeWidth". Then it
> > should calculate extra buffer and release it. Right now, it calculates
> > based on actually allocated size and not justifying the intend. Any
> > thought??
> > ===line 162 at "
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java
> > "
> > protected DataAndValidityBuffers allocFixedDataAndValidityBufs(int
> > valueCount, int typeWidth) {
> >  long bufferSize = computeCombinedBufferSize(valueCount, typeWidth);
> >  assert bufferSize < MAX_ALLOCATION_SIZE;
> >
> >  int validityBufferSize;
> >  int dataBufferSize;
> >  if (typeWidth == 0) {
> >validityBufferSize = dataBufferSize = (int) (bufferSize / 2);
> >  } else {
> >// Due to roundup to power-of-2 allocation, the bufferSize could be
> > greater than the
> >// requested size. Utilize the allocated buffer fully.;
> >int actualCount = (int) ((bufferSize * 8.0) / (8 * typeWidth + 1));
> >do {
> >  validityBufferSize = (int)
> > roundUp8(getValidityBufferSizeFromCount(actualCount));
> >  dataBufferSize = (int) roundUp8(actualCount * typeWidth);
> >  if (validityBufferSize + dataBufferSize <= bufferSize) {
> >break;
> >  }
> >  --actualCount;
> >} while (true);
> >  }
> > 
> > Thanks.Hitesh.
> >
>

Re: Java allocate buffer code

2019-03-27 Thread Siddharth Teotia

Hi Hitesh,

The code you referenced allocates data and validity buffers for a fixed
width vector. It first determines the appropriate buffer size for a given
value count and then allocates a compound buffer. The compound buffer is
then sliced to get data and validity buffers and finally compound buffer is
released. Were you referring to compound buffer as extra buffer?

Also, actualCount can't be equal to valueCount * typeWidth since it
represents the number of values that can be stored in the vector.
valueCount * typeWidth will give you the buffer size to be allocated for a
certain value count and data type.

Thanks,
Siddharth

On Wed, Mar 27, 2019 at 11:58 AM Hitesh Khamesra
 wrote:

> Hi All:
> I was looking following code to release extra allocated buffer. It seems
> it should be considering actualCount as "valueCount*typeWidth". Then it
> should calculate extra buffer and release it. Right now, it calculates
> based on actually allocated size and not justifying the intend. Any
> thought??
> ===line 162 at "
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java
> "
> protected DataAndValidityBuffers allocFixedDataAndValidityBufs(int
> valueCount, int typeWidth) {
>   long bufferSize = computeCombinedBufferSize(valueCount, typeWidth);
>   assert bufferSize < MAX_ALLOCATION_SIZE;
>
>   int validityBufferSize;
>   int dataBufferSize;
>   if (typeWidth == 0) {
> validityBufferSize = dataBufferSize = (int) (bufferSize / 2);
>   } else {
> // Due to roundup to power-of-2 allocation, the bufferSize could be
> greater than the
> // requested size. Utilize the allocated buffer fully.;
> int actualCount = (int) ((bufferSize * 8.0) / (8 * typeWidth + 1));
> do {
>   validityBufferSize = (int)
> roundUp8(getValidityBufferSizeFromCount(actualCount));
>   dataBufferSize = (int) roundUp8(actualCount * typeWidth);
>   if (validityBufferSize + dataBufferSize <= bufferSize) {
> break;
>   }
>   --actualCount;
> } while (true);
>   }
> 
> Thanks.Hitesh.
>

Re: Arrow development sync call today 12pm Eastern / 17:00 UTC

2018-11-14 Thread Siddharth Teotia

Notes:

Attendees:
Sidd
Wes
Ravindra
Arvind
Shyam
Bryan
Francois

Bryan:
1. Switching over to Java time from Joda time. At Dremio we need to assess
the impact of these changes. Bryan will put a WIP PR soon. There has been a
discussion about this on mailing list
2. The Gandiva microbenchmark test fails if the elapsed time exceeds a
certain threshold and this results in some spurious failures in travis CI.
Ravindra will disable the threshold checks.

Ravindra:
1. Working on Decimal support in Gandiva. Will raise PR later this week.

Wes:
1. We should push for 0.12 release after Thanksgiving.
2. Volunteers needed for jira cleanup.
3. We need to think about Gandiva packaging for release.

On Wed, Nov 14, 2018 at 7:20 AM Wes McKinney  wrote:

> All are welcome
>
> https://meet.google.com/vtm-teks-phx
>

[jira] [Created] (ARROW-3194) Fix setValueCount in spitAndTransfer for variable width vectors

2018-09-07 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-3194:
---

 Summary: Fix setValueCount in spitAndTransfer for variable width 
vectors
 Key: ARROW-3194
 URL: https://issues.apache.org/jira/browse/ARROW-3194
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We need to use the split length as the value count of the target vector. We are 
incorrectly using the value count of the current vector for the target vector



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Arrow Sync

2018-08-22 Thread Siddharth Teotia

I have a clash this morning so won't be able to join the call.

Re: [VOTE] Accept donation of Gandiva to Apache Arrow

2018-08-16 Thread Siddharth Teotia

+1

On Thu, Aug 16, 2018 at 9:57 AM, Julian Hyde  wrote:

> +1
> On Thu, Aug 16, 2018 at 8:56 AM Wes McKinney  wrote:
> >
> > Dear all,
> >
> > The developers of Gandiva, an LLVM-based vectorized expression
> > evaluation engine for Arrow columnar memory, are proposing to donate
> > the project to Apache Arrow at some point in the near future, as has
> > been discussed on the dev@ mailing list [1].
> >
> > The Gandiva codebase is located at:
> >
> > https://github.com/dremio/gandiva
> >
> > This work is not yet in a patch-ready state, but I wish to determine
> > if the Arrow PMC is in favor of accepting this donation, subject to
> > the fulfillment of the ASF IP Clearance process.
> >
> > [ ] +1 : Accept contribution of Gandiva
> > [ ]  0 : No opinion
> > [ ] -1 : Reject contribution because...
> >
> > Here is my vote: +1
> >
> > The vote will be open for at least 72 hours.
> >
> > Thanks,
> > Wes
> >
> > [1]: https://lists.apache.org/thread.html/cded0b511c68da21246cd25e99b4ad
> 77092d17219629f73e0dc85cad@%3Cdev.arrow.apache.org%3E
>

Re: [JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

2018-07-23 Thread Siddharth Teotia

Also look here
<https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/copier/FieldBufferCopier.java#L37>
to see how validity and data are copied independently between two vectors
bypassing all Arrow APIs and directly manipulating memory. The link points
to copying of data buffer. Further down in the file, you will see BitCopier
to copy validity bits.

On Mon, Jul 23, 2018 at 5:19 PM, Siddharth Teotia 
wrote:

> Eric, you can take a look here
> <https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java#L176>
> how we try to optimize the copy (validity and data) in/out of vectors. We
> try to start with word-wise copy (64 column values and thus 64 validity
> bits) and then accordingly branch. Similar to this there are other examples
> of manipulating off heap buffers through PlatformDependent APIs -- which I
> think is same as using sun.misc.UNSAFE as the former eventually uses the
> latter underneath.
>
> In my opinion, we should take a look at  vector APIs and see where can we
> possibly eliminate branches. I did some of it earlier -- as an example see
> this
> <https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L144>
> for branch free cell-to-cell copy between two columns. The idea is to copy
> junk data disregarding the validity bit. As long as the validity bit is
> copied correctly, we are good.
>
> Couple of other things have been there on my todo list but haven't yet
> gotten to them. Like for your example, we should remove the branch here
> <https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/IntVector.java#L277>
> The caller is already telling us that validity bit is set or not (1 or 0 in
> parameter isSet). So just following is enough to set the value (Null or non
> null) at any cell and I think this will speed up the tight for loop in your
> application. There is no need to check whether isSet is 1 or 0. We should
> be simply setting it.
>
> PS:
>
> I have long been a proponent of implementing native SIMD acceleration
> support into Arrow libraries and have plugged this shamelessly in emails
> every now and then (like this one). Use cases like yours and several others
> can then be natively supported by Arrow for most optimal execution.
>
> On several occasions we have seen topic of SIMD acceleration support with
> Arrow coming up on mailing list and I think it's high time we should do
> something about it and build kernels  which can do simple tight loop
> operations like sum (vector1, vector2, num_values) and several others
> extremely efficiently.
>
>
> On Mon, Jul 23, 2018 at 4:44 PM, Wes McKinney  wrote:
>
>> hi Eric,
>>
>> Antoine recently did some work on faster bitsetting in C++ by
>> unrolling the main loop to set one byte at a time
>>
>> https://github.com/apache/arrow/blob/27b869ae5df31f3be61e76e
>> 9d96ea7d9b557/cpp/src/arrow/util/bit-util.h#L598
>>
>> This yielded major speedups when setting a lot of bits. A similar
>> strategy should be possible in Java for your use case. We speculated
>> that it could be made even faster by eliminating the branch in the
>> bit-setting assignments (the g() | left_branch : right_branch
>> statements). If you dig around in the Dremio codebase you can find
>> plenty of low level off-heap memory manipulation that may be helpful
>> (others may be able to comment).
>>
>> If some utilities could be developed here in the Arrow Java codebase
>> for common benefit, that would be great.
>>
>> Otherwise copying the values data without branching is an obvious
>> optimization. Others may have ideas
>>
>> - Wes
>>
>> On Mon, Jul 23, 2018 at 5:50 PM, Eric Wohlstadter 
>> wrote:
>> > Hi all,
>> >   I work on a project that uses Arrow streaming format to transfer data
>> > between Java processes.
>> > We're also following the progress on Java support for Plasma, and may
>> > decide use Plasma also.
>> >
>> > We typically uses a pattern like this to fill Arrow vectors from Java
>> > arrays:
>> > 
>> > int[] inputValues = ...;
>> > boolean[] nullInputValues = ...;
>> >
>> > org.apache.arrow.vector.IntVector vector = ...;
>> > for(int i = 0; i < inputValues.size; i++) {
>> >   if(nullInputValues[i]) {
>> > vector.setNull(i);
>> >   } else {
>> > vector.set(i, inputValues[i]);
>> >   }
>> > }
>> > 
>> >
>> > Obviously the JIT won

Re: [JAVA] SIMD vectorized fill of ArrowBuf from Java primitive type array?

2018-07-23 Thread Siddharth Teotia

Eric, you can take a look here

how we try to optimize the copy (validity and data) in/out of vectors. We
try to start with word-wise copy (64 column values and thus 64 validity
bits) and then accordingly branch. Similar to this there are other examples
of manipulating off heap buffers through PlatformDependent APIs -- which I
think is same as using sun.misc.UNSAFE as the former eventually uses the
latter underneath.

In my opinion, we should take a look at  vector APIs and see where can we
possibly eliminate branches. I did some of it earlier -- as an example see
this

for branch free cell-to-cell copy between two columns. The idea is to copy
junk data disregarding the validity bit. As long as the validity bit is
copied correctly, we are good.

Couple of other things have been there on my todo list but haven't yet
gotten to them. Like for your example, we should remove the branch here

The caller is already telling us that validity bit is set or not (1 or 0 in
parameter isSet). So just following is enough to set the value (Null or non
null) at any cell and I think this will speed up the tight for loop in your
application. There is no need to check whether isSet is 1 or 0. We should
be simply setting it.

PS:

I have long been a proponent of implementing native SIMD acceleration
support into Arrow libraries and have plugged this shamelessly in emails
every now and then (like this one). Use cases like yours and several others
can then be natively supported by Arrow for most optimal execution.

On several occasions we have seen topic of SIMD acceleration support with
Arrow coming up on mailing list and I think it's high time we should do
something about it and build kernels  which can do simple tight loop
operations like sum (vector1, vector2, num_values) and several others
extremely efficiently.

On Mon, Jul 23, 2018 at 4:44 PM, Wes McKinney  wrote:

> hi Eric,
>
> Antoine recently did some work on faster bitsetting in C++ by
> unrolling the main loop to set one byte at a time
>
> https://github.com/apache/arrow/blob/27b869ae5df31f3be61e76e9d9
> 6ea7d9b557/cpp/src/arrow/util/bit-util.h#L598
>
> This yielded major speedups when setting a lot of bits. A similar
> strategy should be possible in Java for your use case. We speculated
> that it could be made even faster by eliminating the branch in the
> bit-setting assignments (the g() | left_branch : right_branch
> statements). If you dig around in the Dremio codebase you can find
> plenty of low level off-heap memory manipulation that may be helpful
> (others may be able to comment).
>
> If some utilities could be developed here in the Arrow Java codebase
> for common benefit, that would be great.
>
> Otherwise copying the values data without branching is an obvious
> optimization. Others may have ideas
>
> - Wes
>
> On Mon, Jul 23, 2018 at 5:50 PM, Eric Wohlstadter 
> wrote:
> > Hi all,
> >   I work on a project that uses Arrow streaming format to transfer data
> > between Java processes.
> > We're also following the progress on Java support for Plasma, and may
> > decide use Plasma also.
> >
> > We typically uses a pattern like this to fill Arrow vectors from Java
> > arrays:
> > 
> > int[] inputValues = ...;
> > boolean[] nullInputValues = ...;
> >
> > org.apache.arrow.vector.IntVector vector = ...;
> > for(int i = 0; i < inputValues.size; i++) {
> >   if(nullInputValues[i]) {
> > vector.setNull(i);
> >   } else {
> > vector.set(i, inputValues[i]);
> >   }
> > }
> > 
> >
> > Obviously the JIT won't be able to vectorize this loop. Does anyone know
> if
> > there is another way to achieve this which
> > would be vectorized?
> >
> > Here is a pseudo-code mockup of what I was thinking about, is this
> approach
> > worth pursuing?
> >
> > The idea is to try to convert input into Arrow format in a vectorized
> loop,
> > and then use sun.misc.Unsafe to copy the
> > converted on-heap input to an off-heap valueBuffer.
> >
> > I'll ignore the details of the validityBuffer here, since it would follow
> > along the same lines:
> >
> > 
> > int[] inputValues = ...;
> > org.apache.arrow.vector.IntVector vector = ...;
> >
> > for(int i = 0; i < inputValues.size; i++) {
> >   //convert inputValues[i] to little-endian
> >   //this conversion can be SIMD vectorized?
> > }
> > UNSAFE.copyMemory(
> >   inputValues,
> >   0,
> >   null,
> >   vector.getDataBuffer().memoryAddress(),
> >   sizeof(Integer.class) * inputValues.size
> > );
> > 
> >
> > Thanks for any feedback about details I may be misunderstanding, which
> > would make this approach infeasible.
>

Re: [DISCUSS] Developing a standard memory layout for in-memory records / "row-oriented" data

2018-06-27 Thread Siddharth Teotia

I am wondering if this can be considered as an opportunity to implement
support in Arrow for building high performance in-memory row stores for low
latency and high throughput key based queries. In other words, we can
design the in-memory record format keeping efficient RDMA reads as one of
the goals too. Consider two data structures in memory -- a  hash table and
a row-store comprising of records in Arrow row format. Hashtable points to
row store and information can be read from both data structures without
interrupting the CPU on server. This client-server code-path support can
also be incorporated into Arrow Flight

On Tue, Jun 26, 2018 at 7:49 PM, Jacques Nadeau  wrote:

> I'm not sure this makes sense as an external stable api. I definitely think
> it is useful as an internal representation for use within a particular
> algorithm. I also think that can be informed by the particular algorithm
> that you're working on.
>
> We definitely had this requirement in Dremio and came up with an internal
> representation that we are happy with for the use in hash tables. I'll try
> to dig up the design docs we had around this but the actual
> pivoting/unpivoting code that we developed can be seen here: [1], [2].
>
> Our main model is two blocks: a fixed width block and a variable width
> block (with the fixed width block also carrying address & length of the
> variable data). Fixed width is randomly accessible and variable width is
> randomly accessible through fixed width.
>
> [1]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java
> [2]
> https://github.com/dremio/dremio-oss/blob/master/sabot/
> kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java
>
> On Tue, Jun 26, 2018 at 10:20 AM, Wes McKinney 
> wrote:
>
> > hi Antoine,
> >
> > On Sun, Jun 24, 2018 at 1:06 PM, Antoine Pitrou 
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > Le 24/06/2018 à 08:24, Wes McKinney a écrit :
> > >>
> > >> If this sounds interesting to the community, I could help to kickstart
> > >> a design process which would likely take a significant amount of time.
> > >> The requirements could be complex (i.e. we might want to support
> > >> variable-size record fields while also providing random access
> > >> guarantees).
> > >
> > > What do you call "variable-sized" here? A scheme where the length of a
> > > record's field is determined by the value of another field in the same
> > > record?
> >
> > As an example, here is a fixed size record
> >
> > record foo {
> >   a: int32;
> >   b: float64;
> >   c: uint8;
> > }
> >
> > With padding suppose this is 16 bytes per record; so if we have a
> > column of these, then random accessing any value in any record is
> > simple.
> >
> > Here's a variable-length record:
> >
> > record bar {
> >   a: string;
> >   b: list;
> > }
> >
> > What I've seen done to represent this in memory is to have a fixed
> > size record followed by a sidecar containing the variable-length data,
> > so the fixed size portion might look something like
> >
> > a_offset: int32;
> > a_length: int32;
> > b_offset: int32;
> > b_length: int32;
> >
> > So from this, you can do random access into the record. If you wanted
> > to do random access on a _column_ of such records, it is similar to
> > our current variable-length Binary type. So it might be that the
> > underlying Arrow memory layout would be FixedSizeBinary for fixed-size
> > records and variable Binary for variable-size records.
> >
> > - Wes
> >
> > >
> > >
> > >
> > > Regards
> > >
> > > Antoine.
> >
>

Re: JDBC Adapter PR - 1759

2018-06-15 Thread Siddharth Teotia

Atul,

I will go through it first thing today morning and assuming everything
looks good, let's plan to get this merged soon.

Thanks for all your work and my apologies for delay.

Thanks
Sidd

On Thu, Jun 14, 2018, 9:58 PM Wes McKinney  wrote:

> I commented on the PR. I think this needs a final review / approval from
> Sidd
>
> Thanks
> Wes
>
> On Wed, Jun 13, 2018 at 11:41 PM, Atul Dambalkar
>  wrote:
> > Hi Sid, Laurent,
> >
> > Any idea when you would get a chance to merge the PR -
> https://github.com/apache/arrow/pull/1759/
> >
> > Regards,
> > -Atul
> >
> > -Original Message-
> > From: Atul Dambalkar
> > Sent: Friday, June 01, 2018 4:39 PM
> > To: dev@arrow.apache.org
> > Subject: RE: JDBC Adapter PR - 1759
> >
> > Hi Laurent,
> >
> > Thanks for your review comments. We have completed the code changes and
> merged as well. I have replied to few your comments. Please take a look
> when you get a chance.
> >
> > Regards,
> > -Atul
> >
> > -Original Message-
> > From: Laurent Goujon [mailto:laur...@dremio.com]
> > Sent: Wednesday, May 30, 2018 5:38 AM
> > To: dev@arrow.apache.org
> > Subject: Re: JDBC Adapter PR - 1759
> >
> > Same here.
> >
> > On Tue, May 29, 2018 at 9:59 AM, Siddharth Teotia 
> > wrote:
> >
> >> Hi Atul,
> >>
> >> I will take a look today.
> >>
> >> Thanks,
> >> Sidd
> >>
> >> On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar <
> >> atul.dambal...@xoriant.com>
> >> wrote:
> >>
> >> > Hi Sid, Laurent, Uwe,
> >> >
> >> > Any idea when can someone take a look at the PR
> >> https://github.com/apache/
> >> > arrow/pull/1759/.
> >> >
> >> > Laurent had given bunch of comments earlier and now we have taken
> >> > care of most of those. We have also added multiple test cases. It
> >> > will be great
> >> if
> >> > someone can take a look.
> >> >
> >> > Regards,
> >> > -Atul
> >> >
> >> >
> >>
>

Re: Arrow sync at 12:00 US/Eastern today

2018-06-13 Thread Siddharth Teotia

I have a conflict so won't be able to join.

On Wed, Jun 13, 2018, 5:46 AM Wes McKinney  wrote:

> As usual we will be meeting at https://meet.google.com/vtm-teks-phx
>

Re: JDBC Adapter PR - 1759

2018-05-29 Thread Siddharth Teotia

Hi Atul,

I will take a look today.

Thanks,
Sidd

On Tue, May 29, 2018 at 2:45 AM, Atul Dambalkar 
wrote:

> Hi Sid, Laurent, Uwe,
>
> Any idea when can someone take a look at the PR https://github.com/apache/
> arrow/pull/1759/.
>
> Laurent had given bunch of comments earlier and now we have taken care of
> most of those. We have also added multiple test cases. It will be great if
> someone can take a look.
>
> Regards,
> -Atul
>
>

Re: Is there list writer in Java?

2018-04-17 Thread Siddharth Teotia

Hi Teddy,

Yes UnionListWriter currently doesn't support writing decimals into list
vector. Basically we are missing APIs like UnionListWriter.decimal() which
will return a DecimalWriter(we already have this) and the latter can be
used to write decimals in list. I'd suggest you to go ahead and file a JIRA
on https://issues.apache.org/.

Please feel free to take it up if you are interested in contributing to
Arrow codebase.

Thanks,
Siddharth

On Tue, Apr 17, 2018 at 8:41 PM, Teddy Choi  wrote:

> Hello, all
>
> I’m new to Apache Arrow, and checking whether it supports most of data
> types. It seems like that there’s no implementation for list
> writer in UnionListWriter yet. Is there other way?
>
> Thanks.
>
> Teddy Choi.

Re: Correct way to set NULL values in VarCharVector (Java API)?

2018-04-11 Thread Siddharth Teotia

Another option is to use the set() API that allows you to indicate whether
the value is NULL or not using an isSet parameter (0 for NULL, 1
otherwise). This is similar to holder based APIs where you need to indicate
in holder.isSet whether value is NULL or not.

https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/BaseVariableWidthVector.java#L1095

Thanks,
Siddharth

On Wed, Apr 11, 2018 at 6:14 AM, Emilio Lahr-Vivaz 
wrote:

> Hi Atul,
>
> You should be able to use the overloaded 'set' method that takes a
> NullableVarCharHolder:
>
> https://github.com/apache/arrow/blob/master/java/vector/src/
> main/java/org/apache/arrow/vector/VarCharVector.java#L237
>
> Thanks,
>
> Emilio
>
>
> On 04/10/2018 05:23 PM, Atul Dambalkar wrote:
>
>> Hi,
>>
>> I wanted to know what's the best way to handle NULL string values coming
>> from a relational database. I am trying to set the string values in Java
>> API - VarCharVector. Like few other Arrow Vectors (TimeStampVector,
>> TimeMilliVector), the VarCharVector doesn't have a way to set a NULL value
>> as one of the elements. Can someone advise what's the correct mechanism to
>> store NULL values in this case.
>>
>> Regards,
>> -Atul
>>
>>
>>
>

Re: What do people think about a one day get together?

2018-04-04 Thread Siddharth Teotia

+1. I would love to attend.

On Tue, Apr 3, 2018 at 4:18 PM, Kevin Moore  wrote:

> Sounds great. Quilt Data may be able to sponsor some of the refreshment
> costs.
>
> 
> Kevin Moore
> CEO, Quilt Data, Inc.
> ke...@quiltdata.io | LinkedIn 
> (415) 497-7895
>
>
> Manage Data like Code
> quiltdata.com
>
> On Tue, Apr 3, 2018 at 1:41 PM, Li Jin  wrote:
>
> > I'd love to attend. I will be around for Spark Summit.
> >
> > Li
> >
> >
> > On Tue, Apr 3, 2018 at 11:48 AM, Jacques Nadeau 
> > wrote:
> >
> > > Hey All,
> > >
> > > In light of growing interest in Apache Arrow over the past year and the
> > > great response to the meetup talk invitation I sent last week, I was
> > > thinking it may be time to hold a single day conference focused on the
> > > project. Wes and I have previously thrown this idea around and it seems
> > > like it might be a good time to get something started. Some of my
> > > colleagues did an investigation on how and when we could do this. I'm
> > > raising this to you all now to get people's thoughts.
> > >
> > >
> > > A rough sketch of what Wes and I have bounced around:
> > >
> > > *One day developer-focused event on Apache Arrow in San Francisco, June
> > 7,
> > > just after Spark Summit (open to other dates, but it would be nice for
> > > folks attending the conference to stay one extra day for Arrow).
> > >
> > > * Focus on interesting use cases and applications of Arrow. We could
> also
> > > use this event to discuss/plan/present about movement to Arrow 1.0 this
> > > year and beyond.
> > >
> > > *Goal of 100-200 attendees.
> > >
> > > *Dremio can offer to organize the event (venue, logistics,
> registrations,
> > > etc). The goal would be to keep ticket costs very modest to encourage
> > > attendance (eg, $50). Opportunity for sponsorship by vendors to help
> > drive
> > > down costs (eg, refreshments).
> > >
> > > *Still need to determine a venue but probably something downtown SF
> > nearish
> > > Moscone.
> > >
> > > *PMC or appointed sub-committee could review talk submissions. We could
> > use
> > > something like EasyChair to make this as simple as possible.
> > >
> > > What do people think? I think this could be good to continue to drive
> and
> > > grow the community in a positive way.
> > >
> > > thanks,
> > > Jacques
> > >
> >
>

Re: Arrow sync tomorrow: 12:00 US/Eastern, please review packaging thread

2018-04-04 Thread Siddharth Teotia

Got it: https://meet.google.com/vtm-teks-phx

On Wed, Apr 4, 2018 at 8:48 AM, Siddharth Teotia 
wrote:

> Can someone please send me the link to gcal? For some reason it has
> vanished from my calendar.
>
> On Wed, Apr 4, 2018 at 7:49 AM, Li Jin  wrote:
>
>> Sorry I have a conflict today so won't be able to join.
>>
>> Li
>>
>> On Wed, Apr 4, 2018 at 1:53 AM, Bhaskar Mookerji 
>> wrote:
>>
>> > Can someone attending this send out notes afterwards? It would be very
>> much
>> > appreciated.
>> >
>> > Thanks,
>> > Buro
>> >
>> > On Tue, Apr 3, 2018 at 2:44 PM, Wes McKinney 
>> wrote:
>> >
>> > > hi folks,
>> > >
>> > > We have a sync call tomorrow. Could everyone please review the
>> > > packaging mailing list thread and if possible review and comment on
>> > > Phillip's document about this? We need to begin taking action to fix
>> > > these problems:
>> > >
>> > > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-
>> > > g9EGPOtcFdtMBzEyDJv48BKc/edit?usp=sharing
>> > >
>> > > Thanks
>> > > Wes
>> > >
>> >
>>
>
>

Re: Arrow sync tomorrow: 12:00 US/Eastern, please review packaging thread

2018-04-04 Thread Siddharth Teotia

Can someone please send me the link to gcal? For some reason it has
vanished from my calendar.

On Wed, Apr 4, 2018 at 7:49 AM, Li Jin  wrote:

> Sorry I have a conflict today so won't be able to join.
>
> Li
>
> On Wed, Apr 4, 2018 at 1:53 AM, Bhaskar Mookerji 
> wrote:
>
> > Can someone attending this send out notes afterwards? It would be very
> much
> > appreciated.
> >
> > Thanks,
> > Buro
> >
> > On Tue, Apr 3, 2018 at 2:44 PM, Wes McKinney 
> wrote:
> >
> > > hi folks,
> > >
> > > We have a sync call tomorrow. Could everyone please review the
> > > packaging mailing list thread and if possible review and comment on
> > > Phillip's document about this? We need to begin taking action to fix
> > > these problems:
> > >
> > > https://docs.google.com/document/d/1IyhbQpiElxTsI8HbMZ-
> > > g9EGPOtcFdtMBzEyDJv48BKc/edit?usp=sharing
> > >
> > > Thanks
> > > Wes
> > >
> >
>

Re: Trouble Updating Java artifacts

2018-03-22 Thread Siddharth Teotia

Thanks, Uwe and Wes.

On Thu, Mar 22, 2018, 5:37 PM Wes McKinney  wrote:

> Thanks Uwe. I opened https://github.com/apache/arrow/pull/1782 about
> documenting this properly in the RM guide
>
> On Thu, Mar 22, 2018 at 3:08 PM, Uwe L. Korn  wrote:
> > Hello,
> >
> > you need to first setup up Maven to know your Apache credentials:
> http://www.apache.org/dev/publishing-maven-artifacts.html#dev-env
> >
> > I have taken care of the upload, please verify that the artifacts are
> all up.
> >
> > Uwe
> >
> > On Wed, Mar 21, 2018, at 5:22 PM, Siddharth Teotia wrote:
> >> Hi All,
> >>
> >> I think the steps mentioned in RM doc for updating java artifacts are
> >> incomplete. I am getting the following error:
> >>
> >> Failed to deploy artifacts: Could not transfer artifact
> >> org.apache.arrow:arrow-java-root:pom:0.9.0 from/to apache.releases.https
> >> (
> >> https://repository.apache.org/service/local/staging/deploy/maven2):
> >> Failed
> >> to transfer file:
> >>
> https://repository.apache.org/service/local/staging/deploy/maven2/org/apache/arrow/arrow-java-root/0.9.0/arrow-java-root-0.9.0.pom
> .
> >> Return code is: 401, ReasonPhrase: Unauthorized
> >>
> >> Wes had filed https://issues.apache.org/jira/browse/ARROW-2322 to
> track the
> >> problem. I understand why I see the problem since I am not a PMC. But it
> >> looks like even PMCs are facing the same issue.
> >>
> >> Does anyone know what needs to be done?
> >>
> >> RM doc:
> >>
> https://github.com/apache/arrow/blob/master/dev/release/RELEASE_MANAGEMENT.md
> >>
> >>
> >> Thanks,
> >> Sidd
>

Trouble Updating Java artifacts

2018-03-21 Thread Siddharth Teotia

Hi All,

I think the steps mentioned in RM doc for updating java artifacts are
incomplete. I am getting the following error:

Failed to deploy artifacts: Could not transfer artifact
org.apache.arrow:arrow-java-root:pom:0.9.0 from/to apache.releases.https (
https://repository.apache.org/service/local/staging/deploy/maven2): Failed
to transfer file:
https://repository.apache.org/service/local/staging/deploy/maven2/org/apache/arrow/arrow-java-root/0.9.0/arrow-java-root-0.9.0.pom.
Return code is: 401, ReasonPhrase: Unauthorized

Wes had filed https://issues.apache.org/jira/browse/ARROW-2322 to track the
problem. I understand why I see the problem since I am not a PMC. But it
looks like even PMCs are facing the same issue.

Does anyone know what needs to be done?

RM doc:
https://github.com/apache/arrow/blob/master/dev/release/RELEASE_MANAGEMENT.md


Thanks,
Sidd

Re: Arrow Sync Call Started

2018-03-21 Thread Siddharth Teotia

We decided not to have the call since very few people joined possibly due
to timezone confusion.

On Wed, Mar 21, 2018 at 9:04 AM, Wes McKinney  wrote:

> hi folks -- the time zone on the Google calendar invite is set to
> 17:00 GMT which, due to the DST change, is now 1pm Eastern, or 1 hour
> from now. In case there's some confusion, we may need to reschedule
> and make sure we're all agreed on what time zone we're pinned to
>
> On Wed, Mar 21, 2018 at 12:02 PM, Siddharth Teotia 
> wrote:
> > https://www.google.com/url?q=https%3A%2F%2Fmeet.google.com%
> 2Fvtm-teks-phx
>

Arrow Sync Call Started

2018-03-21 Thread Siddharth Teotia

https://www.google.com/url?q=https%3A%2F%2Fmeet.google.com%2Fvtm-teks-phx

[jira] [Created] (ARROW-2329) [Website]: 0.9.0 release update

2018-03-20 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-2329:
---

 Summary: [Website]: 0.9.0 release update
 Key: ARROW-2329
 URL: https://issues.apache.org/jira/browse/ARROW-2329
 Project: Apache Arrow
  Issue Type: Task
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [RESULT] [VOTE] Release Apache Arrow 0.9.0 (RC2)

2018-03-19 Thread Siddharth Teotia

FYI: Created a PR for website update.

On Mon, Mar 19, 2018 at 3:38 PM, Phillip Cloud  wrote:

> Great! I'll volunteer to handle the conda-forge feedstock updates.
>
> On Mon, Mar 19, 2018 at 6:09 PM Wes McKinney  wrote:
>
> > With 4 binding +1 votes, 2 non-binding +1, and no other votes, the
> > vote passes. Thanks all!
> >
> > Let's get busy updating the C++, Python, Java packages and updating
> > the website. I will be able to draft a 0.9.0 blog post for the website
> > later today or tomorrow morning. I suggest we announce the release on
> > Wednesday morning after we have a chance to move along the binary
> > packaging process.
> >
> > Thanks
> > Wes
> >
> > On Mon, Mar 19, 2018 at 2:47 PM, Phillip Cloud 
> wrote:
> > > Just verified on windows, all systems are go for launch.
> > >
> > > On Mon, Mar 19, 2018 at 12:51 PM Li Jin  wrote:
> > >
> > >> +1
> > >>
> > >> Ran Java Tests on Mac OS.
> > >>
> > >> On Mon, Mar 19, 2018 at 11:20 AM, Siddharth Teotia <
> > siddha...@dremio.com>
> > >> wrote:
> > >>
> > >> > +1
> > >> >
> > >> > Verified RC on Mac OS. Things look good.
> > >> >
> > >> > Thanks
> > >> > Sidd
> > >> >
> > >> >
> > >> > On Mar 19, 2018 12:31 AM, "Kouhei Sutou" 
> wrote:
> > >> >
> > >> > +1 (binding), tested on Debian GNU/Linux sid with
> > >> >
> > >> >   * GCC 7.3.0
> > >> >   * OpenJDK 9.0.4
> > >> >   * Ruby 2.5.0p0
> > >> >   * NodeJS 8.9.3
> > >> >
> > >> > --
> > >> > kou
> > >> >
> > >> > In <
> > cakrvfm6zgjyqoev31p0bqse-e-vb-fc5lg3ksf+vdffknsf...@mail.gmail.com>
> > >> >   "Re: [VOTE] Release Apache Arrow 0.9.0 (RC2)" on Mon, 19 Mar 2018
> > >> > 00:32:39 +,
> > >> >   Phillip Cloud  wrote:
> > >> >
> > >> > > +1 (binding), tested on Arch Linux.
> > >> > >
> > >> > > I will verify the RC tomorrow morning (Eastern time) on Windows.
> > >> > >
> > >> > > On Sun, Mar 18, 2018 at 9:40 AM Uwe L. Korn 
> > wrote:
> > >> > >
> > >> > >> +1 (binding), tested on Ubuntu 16.04
> > >> > >>
> > >> > >> > Am 16.03.2018 um 18:41 schrieb Wes McKinney <
> wesmck...@gmail.com
> > >:
> > >> > >> >
> > >> > >> > +1 (binding). Ran dev/release/verify-release-candidate.sh on
> > Ubuntu
> > >> > >> 16.04 with
> > >> > >> >
> > >> > >> > * gcc 5.4.0
> > >> > >> > * JDK8
> > >> > >> > * Ruby 2.5.0p0
> > >> > >> > * NodeJS 8.10.0
> > >> > >> >
> > >> > >> > I don't have access to a Windows dev machine at the moment; if
> > >> someone
> > >> > >> > could verify the RC on Windows that would be very helpful
> > >> > >> >
> > >> > >> >> On Fri, Mar 16, 2018 at 1:39 PM, Wes McKinney <
> > wesmck...@gmail.com
> > >> >
> > >> > >> wrote:
> > >> > >> >> Hello all,
> > >> > >> >>
> > >> > >> >> I'd like to propose the 1st release candidate (rc2) of Apache
> > Arrow
> > >> > >> version
> > >> > >> >> 0.9.0 (rc0 and rc1 were never voted on due to problems I
> > discovered
> > >> > >> while
> > >> > >> >> verifying). This is a major release consisting of 258 resolved
> > >> JIRAs
> > >> > >> [1].
> > >> > >> >>
> > >> > >> >> The source release rc2 is hosted at [2].
> > >> > >> >>
> > >> > >> >> This release candidate is based on commit
> > >> > >> >> c695a5ddc8d26c977b5ecd0c55212e900726953e [3]
> > >> > >> >>
> > >> > >> >> The changelog is located at [4].
> > >> > >> >>
> > >> > >> >> Please download, verify checksums and signatures, run the unit
> > >> tests,
> > >> > >> >> and vote on the release.
> > >> > >> >>
> > >> > >> >> The vote will be open for at least 72 hours.
> > >> > >> >>
> > >> > >> >> [ ] +1 Release this as Apache Arrow 0.9.0
> > >> > >> >> [ ] +0
> > >> > >> >> [ ] -1 Do not release this as Apache Arrow 0.9.0 because...
> > >> > >> >>
> > >> > >> >> Thanks,
> > >> > >> >> Wes
> > >> > >> >>
> > >> > >> >> How to validate a release signature:
> > >> > >> >> https://httpd.apache.org/dev/verification.html
> > >> > >> >>
> > >> > >> >> [1]:
> > >> > >> >>
> > >> > >> https://issues.apache.org/jira/issues/?jql=project%20%
> > >> > 3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%
> > >> > 20AND%20fixVersion%20%3D%200.9.0
> > >> > >> >> [2]:
> > >> > >>
> > https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.9.0-rc2/
> > >> > >> >> [3]:
> > >> > >>
> > https://github.com/apache/arrow/tree/c695a5ddc8d26c977b5ecd0c55212e
> > >> > 900726953e
> > >> > >> >> [4]:
> > >> > >>
> > https://github.com/apache/arrow/blob/c695a5ddc8d26c977b5ecd0c55212e
> > >> > 900726953e/CHANGELOG.md
> > >> > >>
> > >> > >>
> > >> >
> > >>
> >
>

Re: [VOTE] Release Apache Arrow 0.9.0 (RC2)

2018-03-19 Thread Siddharth Teotia

+1

Verified RC on Mac OS. Things look good.

Thanks
Sidd


On Mar 19, 2018 12:31 AM, "Kouhei Sutou"  wrote:

+1 (binding), tested on Debian GNU/Linux sid with

  * GCC 7.3.0
  * OpenJDK 9.0.4
  * Ruby 2.5.0p0
  * NodeJS 8.9.3

--
kou

In 
  "Re: [VOTE] Release Apache Arrow 0.9.0 (RC2)" on Mon, 19 Mar 2018
00:32:39 +,
  Phillip Cloud  wrote:

> +1 (binding), tested on Arch Linux.
>
> I will verify the RC tomorrow morning (Eastern time) on Windows.
>
> On Sun, Mar 18, 2018 at 9:40 AM Uwe L. Korn  wrote:
>
>> +1 (binding), tested on Ubuntu 16.04
>>
>> > Am 16.03.2018 um 18:41 schrieb Wes McKinney :
>> >
>> > +1 (binding). Ran dev/release/verify-release-candidate.sh on Ubuntu
>> 16.04 with
>> >
>> > * gcc 5.4.0
>> > * JDK8
>> > * Ruby 2.5.0p0
>> > * NodeJS 8.10.0
>> >
>> > I don't have access to a Windows dev machine at the moment; if someone
>> > could verify the RC on Windows that would be very helpful
>> >
>> >> On Fri, Mar 16, 2018 at 1:39 PM, Wes McKinney 
>> wrote:
>> >> Hello all,
>> >>
>> >> I'd like to propose the 1st release candidate (rc2) of Apache Arrow
>> version
>> >> 0.9.0 (rc0 and rc1 were never voted on due to problems I discovered
>> while
>> >> verifying). This is a major release consisting of 258 resolved JIRAs
>> [1].
>> >>
>> >> The source release rc2 is hosted at [2].
>> >>
>> >> This release candidate is based on commit
>> >> c695a5ddc8d26c977b5ecd0c55212e900726953e [3]
>> >>
>> >> The changelog is located at [4].
>> >>
>> >> Please download, verify checksums and signatures, run the unit tests,
>> >> and vote on the release.
>> >>
>> >> The vote will be open for at least 72 hours.
>> >>
>> >> [ ] +1 Release this as Apache Arrow 0.9.0
>> >> [ ] +0
>> >> [ ] -1 Do not release this as Apache Arrow 0.9.0 because...
>> >>
>> >> Thanks,
>> >> Wes
>> >>
>> >> How to validate a release signature:
>> >> https://httpd.apache.org/dev/verification.html
>> >>
>> >> [1]:
>> >>
>> https://issues.apache.org/jira/issues/?jql=project%20%
3D%20ARROW%20AND%20status%20in%20(Resolved%2C%20Closed)%
20AND%20fixVersion%20%3D%200.9.0
>> >> [2]:
>> https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-0.9.0-rc2/
>> >> [3]:
>> https://github.com/apache/arrow/tree/c695a5ddc8d26c977b5ecd0c55212e
900726953e
>> >> [4]:
>> https://github.com/apache/arrow/blob/c695a5ddc8d26c977b5ecd0c55212e
900726953e/CHANGELOG.md
>>
>>

[jira] [Created] (ARROW-2294) Fix splitAndTransfer for variable width vector

2018-03-09 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-2294:
---

 Summary: Fix splitAndTransfer for variable width vector
 Key: ARROW-2294
 URL: https://issues.apache.org/jira/browse/ARROW-2294
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


When we splitAndTransfer a vector, the value count to set for the target vector 
should be equal to split length and not the value count of source vector. 

We have seen cases in operator slike FLATTEN and under low memory conditions, 
we end up allocating a lot more memory for the target vector because of using a 
large value in setValueCount after split and transfer is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Working towards getting 0.9.0 release candidate up next week

2018-03-08 Thread Siddharth Teotia

Thanks, Wes. Let's shoot for Monday.

On Thu, Mar 8, 2018 at 11:31 AM, Wes McKinney  wrote:

> Since almost all of the items in TODO are C++ or Python issues, I can
> do a final review today to remove anything that isn't absolutely
> necessary for 0.9.0. We have a couple of nasty bugs still in TODO that
> we should try to fix -- in the event that they cannot be fixed, we may
> need to do a 0.9.1 in a week or two. I would suggest we wait to cut
> the RC until Monday to give enough time for these last items to get
> fixes in.
>
> There are some other things that need doing, like updates per changes
> to the ASF checksum policy ARROW-2268.
>
> I can write by EOD today with a status report on the issues in TODO.
>
> I believe you need to be a PMC to undertake the source release process
> prior to the vote -- I am happy to help with this on Monday.
>
> - Wes
>
> On Thu, Mar 8, 2018 at 2:25 PM, Siddharth Teotia 
> wrote:
> > All,
> >
> > I plan to get RC out over the weekend or early Monday. Is that fine with
> > everybody?
> >
> > We have 6 items in progress --
> > https://issues.apache.org/jira/projects/ARROW/versions/
> 12341707#release-report-tab-body.
> > How do people feel about completing these JIRAs by tomorrow? I am
> > completely fine with deferring the RC to early next week (Mon/Tue/Wed) if
> > necessary. Just looking for consensus. Also, I suggest that we defer the
> > ones with TODO status. I will do it later today unless I hear otherwise.
> >
> > I was wondering if anyone else is interested in collaborating for the
> > post-release tasks. As per
> > https://github.com/apache/arrow/blob/master/dev/release/
> RELEASE_MANAGEMENT.md,
> > following are the high level post-release tasks. Please let me know if
> you
> > would like to take up something. I have written my name against some of
> > them.
> >
> >
> >- Updating the Arrow Website (Sidd)
> >- Uploading release artifacts to SVN -- looks like PMC karma is needed
> >to do this
> >- Announcing release (Sidd)
> >- Updating website with new API documentation (Sidd)
> >- Updating pip packages for C++ and Python
> >- Updating conda packages for C++ and Python (Sidd)
> >- Updating Java Maven artifacts in Maven central (Sidd)
> >- Release blog post
> >
> > If anything is missing, please add to the above list. It will be helpful
> > for tracking.
> >
> > Thanks,
> > Sidd
> >
> > On Sun, Mar 4, 2018 at 12:34 PM, Wes McKinney 
> wrote:
> >
> >> hey Sidd,
> >>
> >> The Python backlog is still in pretty rough shape. I'd like to see if
> >> we can make an RC by Friday but if not we can defer to Monday/Tuesday
> >> the following week (3/12 or 13). I will trim as much as possible out
> >> of the current backlog to get things down to the essential
> >>
> >> - Wes
> >>
> >> On Sun, Feb 25, 2018 at 11:58 AM, Siddharth Teotia <
> siddha...@dremio.com>
> >> wrote:
> >> > Sounds good.
> >> >
> >> > Thanks
> >> > Sidd
> >> >
> >> > On Feb 24, 2018 6:24 PM, "Wes McKinney"  wrote:
> >> >
> >> > Hi Sidd,
> >> >
> >> > I think we have too many bugs to make an RC this coming week. I
> suggest
> >> we
> >> > defer to the following week.
> >> >
> >> > Thanks
> >> > Wes
> >> >
> >> > On Feb 24, 2018 7:09 PM, "Siddharth Teotia" 
> >> wrote:
> >> >
> >> > Hi All,
> >> >
> >> > We currently have 10 issues in progress and PRs are available for 8 of
> >> > them. In interest of getting a release candidate next week, I would
> >> request
> >> > people to review PRs as soon as they can to help make progress and
> close
> >> > out as many JIRAs as we can.
> >> >
> >> > There are 32 issues in TODO list and 25 of them are not yet assigned.
> I
> >> am
> >> > planning to defer some of the unassigned ones later today or
> tomorrow. It
> >> > would be good to soon grab/assign the issues that people want to be
> fixed
> >> > for 0.9.0.
> >> >
> >> > Here is the link to backlog:
> >> > https://issues.apache.org/jira/projects/ARROW/versions/12341707
> >> >
> >> > Thanks,
> >> > Sidd
> >>
>

Re: Working towards getting 0.9.0 release candidate up next week

2018-03-08 Thread Siddharth Teotia

All,

I plan to get RC out over the weekend or early Monday. Is that fine with
everybody?

We have 6 items in progress --
https://issues.apache.org/jira/projects/ARROW/versions/12341707#release-report-tab-body.
How do people feel about completing these JIRAs by tomorrow? I am
completely fine with deferring the RC to early next week (Mon/Tue/Wed) if
necessary. Just looking for consensus. Also, I suggest that we defer the
ones with TODO status. I will do it later today unless I hear otherwise.

I was wondering if anyone else is interested in collaborating for the
post-release tasks. As per
https://github.com/apache/arrow/blob/master/dev/release/RELEASE_MANAGEMENT.md,
following are the high level post-release tasks. Please let me know if you
would like to take up something. I have written my name against some of
them.

   - Updating the Arrow Website (Sidd)
   - Uploading release artifacts to SVN -- looks like PMC karma is needed
   to do this
   - Announcing release (Sidd)
   - Updating website with new API documentation (Sidd)
   - Updating pip packages for C++ and Python
   - Updating conda packages for C++ and Python (Sidd)
   - Updating Java Maven artifacts in Maven central (Sidd)
   - Release blog post

If anything is missing, please add to the above list. It will be helpful
for tracking.

Thanks,
Sidd

On Sun, Mar 4, 2018 at 12:34 PM, Wes McKinney  wrote:

> hey Sidd,
>
> The Python backlog is still in pretty rough shape. I'd like to see if
> we can make an RC by Friday but if not we can defer to Monday/Tuesday
> the following week (3/12 or 13). I will trim as much as possible out
> of the current backlog to get things down to the essential
>
> - Wes
>
> On Sun, Feb 25, 2018 at 11:58 AM, Siddharth Teotia 
> wrote:
> > Sounds good.
> >
> > Thanks
> > Sidd
> >
> > On Feb 24, 2018 6:24 PM, "Wes McKinney"  wrote:
> >
> > Hi Sidd,
> >
> > I think we have too many bugs to make an RC this coming week. I suggest
> we
> > defer to the following week.
> >
> > Thanks
> > Wes
> >
> > On Feb 24, 2018 7:09 PM, "Siddharth Teotia" 
> wrote:
> >
> > Hi All,
> >
> > We currently have 10 issues in progress and PRs are available for 8 of
> > them. In interest of getting a release candidate next week, I would
> request
> > people to review PRs as soon as they can to help make progress and close
> > out as many JIRAs as we can.
> >
> > There are 32 issues in TODO list and 25 of them are not yet assigned. I
> am
> > planning to defer some of the unassigned ones later today or tomorrow. It
> > would be good to soon grab/assign the issues that people want to be fixed
> > for 0.9.0.
> >
> > Here is the link to backlog:
> > https://issues.apache.org/jira/projects/ARROW/versions/12341707
> >
> > Thanks,
> > Sidd
>

Arrow sync call today - March 7

2018-03-07 Thread Siddharth Teotia

I will be at Strata conference today and won't be able to join the call.

Thanks
Sidd

Re: Working towards getting 0.9.0 release candidate up next week

2018-02-25 Thread Siddharth Teotia

Sounds good.

Thanks
Sidd

On Feb 24, 2018 6:24 PM, "Wes McKinney"  wrote:

Hi Sidd,

I think we have too many bugs to make an RC this coming week. I suggest we
defer to the following week.

Thanks
Wes

On Feb 24, 2018 7:09 PM, "Siddharth Teotia"  wrote:

Hi All,

We currently have 10 issues in progress and PRs are available for 8 of
them. In interest of getting a release candidate next week, I would request
people to review PRs as soon as they can to help make progress and close
out as many JIRAs as we can.

There are 32 issues in TODO list and 25 of them are not yet assigned. I am
planning to defer some of the unassigned ones later today or tomorrow. It
would be good to soon grab/assign the issues that people want to be fixed
for 0.9.0.

Here is the link to backlog:
https://issues.apache.org/jira/projects/ARROW/versions/12341707

Thanks,
Sidd

Working towards getting 0.9.0 release candidate up next week

2018-02-24 Thread Siddharth Teotia

Hi All,

We currently have 10 issues in progress and PRs are available for 8 of
them. In interest of getting a release candidate next week, I would request
people to review PRs as soon as they can to help make progress and close
out as many JIRAs as we can.

There are 32 issues in TODO list and 25 of them are not yet assigned. I am
planning to defer some of the unassigned ones later today or tomorrow. It
would be good to soon grab/assign the issues that people want to be fixed
for 0.9.0.

Here is the link to backlog:
https://issues.apache.org/jira/projects/ARROW/versions/12341707

Thanks,
Sidd

Re: Allocating additional memory to the Java Vector objects

2018-02-23 Thread Siddharth Teotia

Yes, explicitly invoking reAlloc() on vectors is not generally needed even
though it is provided as a public API. If the value capacity is not known
upfront or grows dynamically then setSafe() methods will take care of
internally expanding the buffer to store more data -- here we don't have
any fine grained control over the new size as we always double the buffer.

On Fri, Feb 23, 2018 at 4:49 PM, Atul Dambalkar 
wrote:

> Thanks Sidd. Actually, I was looking at the code in base classes for
> Vector implementation, and it does take care of reallocation itself (which
> I was thinking of doing explicitly in the code). Although it uses "reAlloc"
> which allocates double the current size,  for me it works - as I plan to
> start with moderate initial capacity for the vectors.
>
> -Atul
>
> -----Original Message-
> From: Siddharth Teotia [mailto:siddha...@dremio.com]
> Sent: Friday, February 23, 2018 12:14 PM
> To: dev@arrow.apache.org
> Subject: Re: Allocating additional memory to the Java Vector objects
>
> Hi Atul,
>
> Currently there is no way for doing this. The only exposed method of
> expanding the vector buffer is reAlloc() and it allocates a new buffer of
> double the original capacity and copies the old contents into the new
> buffer.
>
> Thanks,
> Sidd
>
> On Fri, Feb 23, 2018 at 12:06 PM, Atul Dambalkar <
> atul.dambal...@xoriant.com
> > wrote:
>
> > Hi,
> >
> > I am creating IntVector in Java as follows - IntVector intVector =
> > (IntVector) vectorSchemaRoot.getVector(name);
> > intVector.setInitialCapacity(100);
> > intVector.allocateNew();
> >
> > Is there a way that I can allocate additional capacity to the same
> > IntVector object by a defined number? Let's say something like -
> > intVector.allocateAdditional(100), which would only add more capacity
> > to the existing buffer without impacting the existing buffer and data.
> >
> > There is an API intVector.reAlloc, but it simply doubles the current
> > allocated memory and not what I intend.
> >
> > Thanks for your inputs,
> > -Atul
> >
> >
>

Re: Allocating additional memory to the Java Vector objects

2018-02-23 Thread Siddharth Teotia

Hi Atul,

Currently there is no way for doing this. The only exposed method of
expanding the vector buffer is reAlloc() and it allocates a new buffer of
double the original capacity and copies the old contents into the new
buffer.

Thanks,
Sidd

On Fri, Feb 23, 2018 at 12:06 PM, Atul Dambalkar  wrote:

> Hi,
>
> I am creating IntVector in Java as follows -
> IntVector intVector = (IntVector) vectorSchemaRoot.getVector(name);
> intVector.setInitialCapacity(100);
> intVector.allocateNew();
>
> Is there a way that I can allocate additional capacity to the same
> IntVector object by a defined number? Let's say something like -
> intVector.allocateAdditional(100), which would only add more capacity to
> the existing buffer without impacting the existing buffer and data.
>
> There is an API intVector.reAlloc, but it simply doubles the current
> allocated memory and not what I intend.
>
> Thanks for your inputs,
> -Atul
>
>

[jira] [Created] (ARROW-2199) Follow up fixes for ARROW-2019. Ensure density driven capacity is never less than 1 and propagate density throughout the vector tree

2018-02-22 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-2199:
---

 Summary: Follow up fixes for ARROW-2019. Ensure density driven 
capacity is never less than 1 and propagate density throughout the vector tree
 Key: ARROW-2199
 URL: https://issues.apache.org/jira/browse/ARROW-2199
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia
 Fix For: 0.9.0






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (ARROW-2019) Control the memory allocated for inner vector in LIST

2018-01-23 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-2019:
---

 Summary: Control the memory allocated for inner vector in LIST
 Key: ARROW-2019
 URL: https://issues.apache.org/jira/browse/ARROW-2019
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We have observed cases in our external sort code where the amount of memory 
actually allocated for a record batch sometimes turns out to be more than 
necessary and also more than what was reserved by the operator for special 
purposes. Thus queries fail with OOM.

Usually to control the memory allocated by vector.allocateNew() is to do a 
setInitialCapacity() and the latter modifies the vector state variables which 
are then used to allocate memory. However, due to the multiplier of 5 used in 
List Vector, we end up asking for more memory than necessary. For example, for 
a value count of 4095, we asked for 128KB of memory for an offset buffer of 
VarCharVector for a field which was list of varchars. 

We did ((4095 * 5) + 1) * 4 => 80KB . => 128KB (rounded off to power of 2 
allocation). 

We had earlier made changes to setInitialCapacity() of ListVector when we were 
facing problems with deeply nested lists and decided to use the multiplier only 
for the leaf scalar vector. 

It looks like there is a need for a specialized setInitialCapacity() for 
ListVector where the caller dictates the repeatedness.

Also, there is another bug in setInitialCapacity() where the allocation of 
validity buffer doesn't obey the capacity specified in setInitialCapacity(). 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: Arrow-Parquet converters in Java

2018-01-17 Thread Siddharth Teotia

Hi Li,

We do have support for Parquet <-> Arrow reader/writer in Dremio OSS.
Please take a look here:

https://github.com/dremio/dremio-oss/tree/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet
https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/store/parquet/columnreaders/DeprecatedParquetVectorizedReader.java

We are yet to discuss how to factor out some/all of such implementation
from Dremio and contribute back to Parquet and/or Arrow.

Thanks,
Sidd

On Wed, Jan 17, 2018 at 10:14 AM, Li Jin  wrote:

> Hey folks,
>
> I know this is supported in C++, but is there a library to convert between
> Arrow and Parquet? (i.e., read Parquet files in Arrow format, write Arrow
> format to Parquet files).
>
> Jacques and Sidd, does Dremio has some library to do this?
>
> Thanks,
> Li
>

[jira] [Created] (ARROW-2001) Add getInitReservation() to BufferAllocator interface similar to getLimit(), getHeadRoom() APIs

2018-01-16 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-2001:
---

 Summary: Add getInitReservation() to BufferAllocator interface 
similar to getLimit(), getHeadRoom() APIs
 Key: ARROW-2001
 URL: https://issues.apache.org/jira/browse/ARROW-2001
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Siddharth Teotia


For capturing additional information for debugging/profiling purposes, it will 
be useful to expose the init reservation for buffer allocator. 

I would encourage someone new to the community to do this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Re: [DRAFT] Apache Arrow board report

2018-01-02 Thread Siddharth Teotia

+1. Thanks, Wes.

On Tue, Jan 2, 2018 at 12:10 PM, Holden Karau  wrote:

> Would it make sense to mention the other Apache projects using/planning to
> use Arrow?
>
> On Tue, Jan 2, 2018 at 11:31 AM Li Jin  wrote:
>
> > +1. Thanks Wes!
> >
> > On Tue, Jan 2, 2018 at 11:19 AM, Uwe L. Korn  wrote:
> >
> > > +1
> > >
> > > On Tue, Jan 2, 2018, at 4:21 PM, Wes McKinney wrote:
> > > > Here is a draft for this quarter's ASF board report. The Activity /
> > > > Health sections are a bit light on detail, if others would like to
> add
> > > > some things feel free to send them along.
> > > >
> > > > thanks
> > > > Wes
> > > >
> > > > ## Description:
> > > >
> > > > Apache Arrow is a cross-language development platform for in-memory
> > > data. It
> > > > specifies a standardized language-independent columnar memory format
> > for
> > > flat
> > > > and hierarchical data, organized for efficient analytic operations on
> > > modern
> > > > hardware. It also provides computational libraries and zero-copy
> > > streaming
> > > > messaging and interprocess communication. Languages currently
> supported
> > > include
> > > > C, C++, Java, JavaScript, Python, and Ruby.
> > > >
> > > > ## Issues:
> > > >
> > > > There are no issues requiring board attention at this time
> > > >
> > > > ## Activity:
> > > >
> > > > - Steady development activity from previous quarter and continued
> > growth
> > > in
> > > >   contributor base
> > > > - Added 5 new committers
> > > > - First JavaScript-only release (0.2.0) made on December 1
> > > >
> > > > ## Health report:
> > > >
> > > > Project is very healthy with a growing developer and user community.
> > > >
> > > > ## PMC changes:
> > > >
> > > >  - Currently 20 PMC members.
> > > >  - No new PMC members added in the last 3 months
> > > >  - Last PMC addition was Kouhei Sutou on Fri Sep 15 2017
> > > >
> > > > ## Committer base changes:
> > > >
> > > >  - Currently 28 committers.
> > > >  - New commmitters:
> > > > - Phillip Cloud was added as a committer on Tue Oct 03 2017
> > > > - Bryan Cutler was added as a committer on Wed Oct 04 2017
> > > > - Li Jin was added as a committer on Fri Oct 06 2017
> > > > - Paul Taylor was added as a committer on Fri Oct 06 2017
> > > > - Siddharth Teotia was added as a committer on Wed Oct 04 2017
> > > >
> > > > ## Releases:
> > > >
> > > >  - 0.8.0 was released on Sat Dec 16 2017
> > > >  - JS-0.2.0 was released on Fri Dec 01 2017
> > > >
> > > > ## JIRA activity:
> > > >
> > > >  - 323 JIRA tickets created in the last 3 months
> > > >  - 300 JIRA tickets closed/resolved in the last 3 months
> > >
> >
> --
> Twitter: https://twitter.com/holdenkarau
>

[jira] [Created] (ARROW-1946) Add APIs to decimal vector for writing big endian data

2017-12-22 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1946:
---

 Summary: Add APIs to decimal vector for writing big endian data
 Key: ARROW-1946
 URL: https://issues.apache.org/jira/browse/ARROW-1946
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We recently moved Dremio to LE Decimal format (similar to Arrow). As part of 
that we introduces some APIs in decimal vector which take a big endian data and 
swap the bytes while writing into the ArrowBuf of decimal vector.

The advantage of these APIs is that caller would not have to allocate an 
additional memory and write( and read) source big endian twice for swapping 
into new memory and using that to write into the vector.

We can directly swap bytes while writing into the vector -- just read once and 
swap while writing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1943) Handle setInitialCapacity() for deeply nested lists of lists

2017-12-20 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1943:
---

 Summary: Handle setInitialCapacity() for deeply nested lists of 
lists
 Key: ARROW-1943
 URL: https://issues.apache.org/jira/browse/ARROW-1943
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


The current implementation of setInitialCapacity() uses a factor of 5 for every 
level we go into list:

So if the schema is LIST (LIST (LIST (LIST (LIST (LIST (LIST (BIGINT)) and 
we start with an initial capacity of 128, we end up not throwing 
OversizedAllocationException from the BigIntVector because at every level we 
increased the capacity by 5 and by the time we reached inner scalar that 
actually stores the data, we were well over max size limit per vector (1MB).

We saw this problem in Dremio when we failed to read deeply nested JSON data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1939) Correct links in release 0.8 blog post

2017-12-19 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1939:
---

 Summary: Correct links in release 0.8 blog post
 Key: ARROW-1939
 URL: https://issues.apache.org/jira/browse/ARROW-1939
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


link to changelog is wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: arrow read/write examples in Java

2017-12-19 Thread Siddharth Teotia

>From Arrow 0.8, the second step "Grab the corresponding mutator and
accessor objects by calls to getMutator(), getAccessor()" is not needed. In
fact, it is not even there.

On Tue, Dec 19, 2017 at 10:01 AM, Siddharth Teotia 
wrote:

> Hi Animesh,
>
> Firstly I would like to suggest switching over to Arrow 0.8 release asap
> since you are writing JAVA programs and the API usage has changed
> drastically. The new APIs are much simpler with good javadocs and detailed
> internal comments.
>
> If you are writing stop-gap implementation then it is probably fine to
> continue with old version but for long term new API usage is recommended.
>
>
>- Create an instance of the vector. Note that this doesn't allocate
>any memory for the elements in the vector
>- Grab the corresponding mutator and accessor objects by calls to
>getMutator(), getAccessor().
>- Allocate memory
>   - *allocateNew()* - we will allocate memory for default number of
>   elements in the vector. This is applicable to both fixed width and 
> variable
>   width vectors.
>   - *allocateNew(valueCount)* -  for fixed width vectors. Use this
>   method if you have already know the number of elements to store in the
>   vector
>   - *allocateNew(bytes, valueCount)* - for variable width vectors.
>   Use this method if you already know the total size (in bytes) of all the
>   variable width elements you will be storing in the vector. For example, 
> if
>   you are going to store 1024 elements in the vector and the total size
>   across all variable width elements is under 1MB, you can call
>   allocateBytes(1024*1024, 1024)
>- Populate the vector:
>   - Use the *set() or setSafe() *APIs in the mutator interface. From
>   Arrow 0.8 onwards, you can use these APIs directly on the vector 
> instance
>   and mutator/accessor are removed.
>   - The difference between set() and corresponding setSafe() API is
>   that latter internally takes care of expanding the vector's buffer(s) 
> for
>   storing new data.
>   - Each set() API has a corresponding setSafe() API.
>- Do a setValueCount() based on the number of elements you populated
>in the vector.
>- Retrieve elements from the vector:
>   - Use the get(), getObject() APIs in the accessor interface. Again,
>   from Arrow 0.8 onwards you can use these APIs directly.
>- With respect to usage of setInitialCapacity:
>   - Let's say your application always issues calls to allocateNew().
>   It is likely that this will end up over-allocating memory because it
>   assumes a default value count to begin with.
>   - In this case, if you do setInitialCapacity() followed by
>   allocateNew() then latter doesn't do default memory allocation. It does
>   exactly for the value capacity you specified in setInitialCapacity().
>
> I would highly recommend taking a look at https://github.com/apache/
> arrow/blob/master/java/vector/src/test/java/org/apache/
> arrow/vector/TestValueVector.java
> This has lots of examples around populating the vector, retrieving from
> vector, using setInitialCapacity(), using set(), setSafe() methods and a
> combination of them to understand when things can go wrong.
>
> Hopefully this helps. Meanwhile we will try to add some internal README
> for the usage of vectors.
>
> Thanks,
> Siddharth
>
> On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz 
> wrote:
>
>> This has probably changed with the Java code refactor, but I've posted
>> some answers inline, to the best of my understanding.
>>
>> Thanks,
>>
>> Emilio
>>
>> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>>
>>> Thanks Wes for you help.
>>>
>>> Based upon some code reading, I managed to code-up a basic working
>>> example.
>>> The code is here:
>>> https://github.com/animeshtrivedi/ArrowExample/tree/master/s
>>> rc/main/java/com/github/animeshtrivedi/arrowexample
>>> .
>>>
>>> However, I do have some questions about the concepts in Arrow
>>>
>>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially
>>> is
>>> the amount of the data one must hold in-memory at a time. Is my
>>> understanding correct?
>>>
>> yes
>>
>>>
>>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>>> classes in the ValueVector interface - both are implemented by all
>>> supported data types. What is the relationship between these two? or when
>>> is one suppose to use one over othe

Re: arrow read/write examples in Java

2017-12-19 Thread Siddharth Teotia

Hi Animesh,

Firstly I would like to suggest switching over to Arrow 0.8 release asap
since you are writing JAVA programs and the API usage has changed
drastically. The new APIs are much simpler with good javadocs and detailed
internal comments.

If you are writing stop-gap implementation then it is probably fine to
continue with old version but for long term new API usage is recommended.


   - Create an instance of the vector. Note that this doesn't allocate any
   memory for the elements in the vector
   - Grab the corresponding mutator and accessor objects by calls to
   getMutator(), getAccessor().
   - Allocate memory
  - *allocateNew()* - we will allocate memory for default number of
  elements in the vector. This is applicable to both fixed width
and variable
  width vectors.
  - *allocateNew(valueCount)* -  for fixed width vectors. Use this
  method if you have already know the number of elements to store in the
  vector
  - *allocateNew(bytes, valueCount)* - for variable width vectors. Use
  this method if you already know the total size (in bytes) of all the
  variable width elements you will be storing in the vector. For
example, if
  you are going to store 1024 elements in the vector and the total size
  across all variable width elements is under 1MB, you can call
  allocateBytes(1024*1024, 1024)
   - Populate the vector:
  - Use the *set() or setSafe() *APIs in the mutator interface. From
  Arrow 0.8 onwards, you can use these APIs directly on the vector instance
  and mutator/accessor are removed.
  - The difference between set() and corresponding setSafe() API is
  that latter internally takes care of expanding the vector's buffer(s) for
  storing new data.
  - Each set() API has a corresponding setSafe() API.
   - Do a setValueCount() based on the number of elements you populated in
   the vector.
   - Retrieve elements from the vector:
  - Use the get(), getObject() APIs in the accessor interface. Again,
  from Arrow 0.8 onwards you can use these APIs directly.
   - With respect to usage of setInitialCapacity:
  - Let's say your application always issues calls to allocateNew(). It
  is likely that this will end up over-allocating memory because
it assumes a
  default value count to begin with.
  - In this case, if you do setInitialCapacity() followed by
  allocateNew() then latter doesn't do default memory allocation. It does
  exactly for the value capacity you specified in setInitialCapacity().

I would highly recommend taking a look at
https://github.com/apache/arrow/blob/master/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java
This has lots of examples around populating the vector, retrieving from
vector, using setInitialCapacity(), using set(), setSafe() methods and a
combination of them to understand when things can go wrong.

Hopefully this helps. Meanwhile we will try to add some internal README for
the usage of vectors.

Thanks,
Siddharth

On Tue, Dec 19, 2017 at 8:55 AM, Emilio Lahr-Vivaz 
wrote:

> This has probably changed with the Java code refactor, but I've posted
> some answers inline, to the best of my understanding.
>
> Thanks,
>
> Emilio
>
> On 12/16/2017 12:17 PM, Animesh Trivedi wrote:
>
>> Thanks Wes for you help.
>>
>> Based upon some code reading, I managed to code-up a basic working
>> example.
>> The code is here:
>> https://github.com/animeshtrivedi/ArrowExample/tree/master/
>> src/main/java/com/github/animeshtrivedi/arrowexample
>> .
>>
>> However, I do have some questions about the concepts in Arrow
>>
>> 1. ArrowBlock is the unit of reading/writing. One ArrowBlock essentially
>> is
>> the amount of the data one must hold in-memory at a time. Is my
>> understanding correct?
>>
> yes
>
>>
>> 2. There are Base[Reade/Writer] interfaces as well as Mutator/Accessor
>> classes in the ValueVector interface - both are implemented by all
>> supported data types. What is the relationship between these two? or when
>> is one suppose to use one over other. I only use Mutator/Accessor classes
>> in my code.
>>
> The write/reader interfaces are parallel implementations that make some
> things easier, but don't encompass all available functionality (for
> example, fixed size lists, nested lists, some dictionary operations, etc).
> However, you should be able to accomplish everything using
> mutators/accessors.
>
>>
>> 3. What are the "safe" varient functions in the Mutator's code? I could
>> not
>> understand what they meant to achieve.
>>
> The safe methods ensure that the vector is large enough to set the value.
> You can use the unsafe versions if you know that your vector has already
> allocated enough space for your data.
>
>> 4. What are MinorTypes?
>>
> Minor types are a representation of the different vector types. I believe
> they are being de-emphasized in favor of FieldTypes, as minor types don't
> contain enough information to represen

Re: Confirming Release Owners

2017-12-17 Thread Siddharth Teotia

Conda is done -- I updated arrow-cpp-feedstock and Uwe took care of
parquet-cpp and pyarrow.

On Sun, Dec 17, 2017 at 4:14 PM, Jacques Nadeau  wrote:

> Wes: Post to Dist, Upload Maven artifacts, Send Announce
> Jacques: Update website/docs
> Sidd (with Help from Uwe): Update conda
> Bryan: Update pip
>
> Did I miss anything? How is everyone trending?
>

[jira] [Created] (ARROW-1922) Blog post on recent improvements/changes in JAVA Vectors

2017-12-13 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1922:
---

 Summary: Blog post on recent improvements/changes in JAVA Vectors
 Key: ARROW-1922
 URL: https://issues.apache.org/jira/browse/ARROW-1922
 Project: Apache Arrow
  Issue Type: Task
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1876) Transfer validity vector buffer data word at a time (currently we do byte at a time)

2017-12-01 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1876:
---

 Summary: Transfer validity vector buffer data word at a time 
(currently we do byte at a time)
 Key: ARROW-1876
 URL: https://issues.apache.org/jira/browse/ARROW-1876
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia
Priority: Minor


We should split and transfer validity buffer contents word at a time. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

General Suggestions and Request regarding recent JAVA changes.

2017-11-29 Thread Siddharth Teotia

Folks,

Over the last couple of weeks, we have had several changes (both merged and
in the pipeline) as follow-up work after ARROW-1463 was merged.

I feel that refactoring suggestions are being proposed on-the-fly while the
developer is already in progress with the code changes and it's too late to
have an opinion on the changes.

It doesn't give the reviewer enough time to understand the rationale behind
the proposed changes and assess its impact downstream and most importantly
have a clear idea of what all changes are being implemented so that
downstream consumers can understand what to expect when they rebase next
time.

Two sets of such follow up changes are already merged to master. For the
ones in pipeline, I request people to send out a doc or spec highlighting
what we are proposing to change and rationale similar to how requirements
and design spec for ARROW-1463 was sent out prior to making any code
changes.

Thanks,
Siddharth

[jira] [Created] (ARROW-1826) [JAVA] Avoid branching at cell level (copyFrom)

2017-11-16 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1826:
---

 Summary: [JAVA] Avoid branching at cell level (copyFrom)
 Key: ARROW-1826
 URL: https://issues.apache.org/jira/browse/ARROW-1826
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1813) Enforce checkstyle failure in JAVA build and fix all checkstyle

2017-11-14 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1813:
---

 Summary: Enforce  checkstyle failure in JAVA build and fix all 
checkstyle
 Key: ARROW-1813
 URL: https://issues.apache.org/jira/browse/ARROW-1813
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Update on ARROW-1463 - Request for merging java-vector-refactor into master

2017-11-13 Thread Siddharth Teotia

Sure, no sweat.

On Mon, Nov 13, 2017 at 3:50 PM, Bryan Cutler  wrote:

> Thanks Siddharth, great work!  I'd like to run through it a final time and
> possibly try out with Spark before merging to master to make sure there
> won't be any big issues.  Could we set tomorrow EOD as the time as the time
> to merge, unless anything comes up?
>
> Thanks,
> Bryan
>
> On Mon, Nov 13, 2017 at 2:39 PM, Li Jin  wrote:
>
> > Thanks Siddharth.
> >
> > I agree we should start merging refactor branch to master - this will
> > allows us to make further Java changes easier. There are still left items
> > on https://issues.apache.org/jira/browse/ARROW-1463, we can work on
> those
> > after merging to master.
> >
> >
> > On Mon, Nov 13, 2017 at 5:31 PM, Siddharth Teotia 
> > wrote:
> >
> > > Functional and Performance testing has been completed with Dremio. We
> > have
> > > seen overall improvement in TPCH numbers. We had about 8000 regression
> > > tests and 12000 unit tests.
> > >
> > > I would like to start the process of merging java-vector-refactor
> branch
> > > into master. The branch has 2 patches with 95% of code changes and a
> > third
> > > patch will be available in couple of hours -- minor bug fixes needed as
> > > part of testing in Dremio.
> > >
> > > I would also like to request to not merge any orthogonal set of changes
> > to
> > > refactor branch at this point. It will require grabbing changes again,
> > and
> > > testing downstream and thus increase the timeline of merging this into
> > > master. It is become slightly difficult to maintain local branches with
> > > such volume of changes and work across patches.
> > >
> > > Follow-up items in the order of priority:
> > >
> > >- buffer consolidation (sooner but probably not this Arrow release)
> --
> > >possibly the highest priority item in the effort to reduce heap
> usage.
> > > This
> > >was touched upon in design spec but it wasn't feasible to squeeze it
> > > into
> > >the work already done on refactor branch.
> > >
> > >
> > >- fate of non-nullable fixed and var width vectors
> > >
> > > Thanks,
> > > Siddharth
> > >
> >
>

Update on ARROW-1463 - Request for merging java-vector-refactor into master

2017-11-13 Thread Siddharth Teotia

Functional and Performance testing has been completed with Dremio. We have
seen overall improvement in TPCH numbers. We had about 8000 regression
tests and 12000 unit tests.

I would like to start the process of merging java-vector-refactor branch
into master. The branch has 2 patches with 95% of code changes and a third
patch will be available in couple of hours -- minor bug fixes needed as
part of testing in Dremio.

I would also like to request to not merge any orthogonal set of changes to
refactor branch at this point. It will require grabbing changes again, and
testing downstream and thus increase the timeline of merging this into
master. It is become slightly difficult to maintain local branches with
such volume of changes and work across patches.

Follow-up items in the order of priority:

   - buffer consolidation (sooner but probably not this Arrow release) --
   possibly the highest priority item in the effort to reduce heap usage. This
   was touched upon in design spec but it wasn't feasible to squeeze it into
   the work already done on refactor branch.


   - fate of non-nullable fixed and var width vectors

Thanks,
Siddharth

[jira] [Created] (ARROW-1807) [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers

2017-11-13 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1807:
---

 Summary: [JAVA] Reduce Heap Usage (Phase 3): consolidate buffers
 Key: ARROW-1807
 URL: https://issues.apache.org/jira/browse/ARROW-1807
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


Consolidate buffers for reducing the volume of objects and heap usage

 => single buffer for fixed width
< validity + offsets> = single buffer for var width, list vector



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: [DISCUSS] readerIndex/writerIndex in Java vector refactor

2017-11-09 Thread Siddharth Teotia

ReaderIndex and WriterIndex are important when we get the buffers (for
sending over the wire). We get the buffers from one or more vectors, build
a compound buffer and slice it on the other end when reconstructing the
vectors. Writer index helps in demarcating the exact end point of last
written data.

When I started to write the patches for refactoring, I wasn't quite sure
about their use but later on learnt it and appropriately set the indexes in
required places in vector code.

On Thu, Nov 9, 2017 at 9:28 AM, Li Jin  wrote:

> Hi All,
>
> I am reading Java vector refactor code and come cross
> readerIndex/writerIndex on ArrowBuf. This issue has been brought up by
> Siddharth
> Teotia earlier but I might have missed the discussion so what to clarify.
>
> My understanding is that the current implementation in java refactor branch
> ignore reader/writerIndex on ArrowBuf. None of the arrow code sets or uses
> reader/writerIndex on ArrowBuf.
>
> I'd like to get thoughts from people regarding this issue:
> (1) Ignoring readerIndex/writerIndex is good because ...
> (2) Ignoring readerIndex/writerIndex is bad because...
>
> The before refactor code - it seems somewhat inconsistent with this matter
> - there are code that uses reader/writerIndex but the "set" method doesn't
> seem to advance writerIndex.
>

Re: Arrow sync today

2017-11-01 Thread Siddharth Teotia

There was no meeting today.

On Wed, Nov 1, 2017 at 10:14 AM, Li Jin  wrote:

> I wasn't able to join the chat room so not sure what's going on. Did we
> have the meeting?
> On Wed, Nov 1, 2017 at 12:55 PM Bryan Cutler  wrote:
>
> > Sorry, I won't be able to make today's call either.  I have been working
> on
> > ARROW-1047 to make a generic message interface for stream format in Java.
> > I'm just curious about where we are at with the Java refactoring. Thanks!
> >
> > On Nov 1, 2017 9:07 AM, "Uwe L. Korn"  wrote:
> >
> > > Due to a public holiday in Germany today I'm also unable to join. Short
> > > heads-up from me : i'm still working on packaging problems with the
> > Python
> > > wheels and on selective categorical conversion for Arrow-> Pandas.
> > >
> > > Uwe
> > >
> > > > Am 01.11.2017 um 16:43 schrieb Wes McKinney :
> > > >
> > > > I am not able to attend today’s Arrow sync. Others are free to meet
> and
> > > > relay notes to the mailing list
> > >
> > >
> >
>

Update on ARROW-1463 (Java Vector Refactoring)

2017-11-01 Thread Siddharth Teotia

Just FYI:

I spent the last week making the necessary code changes in Dremio to move
to new java vector code. This week I am running tests and debugging
failures (plenty at the moment). Trying to complete the work asap.

Thanks,
Siddharth

Re: Arrow sync today

2017-11-01 Thread Siddharth Teotia

I have joined the meeting here -- https://meet.google.com/vtm-teks-phx
I don't see anybody. Can someone please send out the correct link?

On Wed, Nov 1, 2017 at 8:43 AM, Wes McKinney  wrote:

> I am not able to attend today’s Arrow sync. Others are free to meet and
> relay notes to the mailing list
>

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia

Okay, sounds good.

On Fri, Oct 13, 2017 at 2:50 PM, Wes McKinney  wrote:

> It is fine to have not-completely-working states in the refactor
> branch. I recommend do whatever is the most expedient thing to help
> with making progress.
>
> - Wes
>
> On Fri, Oct 13, 2017 at 5:42 PM, Siddharth Teotia 
> wrote:
> > Li,
> >
> > I think there is some confusion. Are you suggesting merging into "java
> > vector refactor" branch or the master? Is it fine to merge stuff on the
> > former branch even though few things are broken (around 10 tests) ? If
> this
> > is allowed, I can do some cleanup (some documentation, some TODOs
> suggested
> > by you and Brian) and we can merge the current patch by EOD or over the
> > weekend.
> >
> > Is this okay? Since we are going to iterate over this branch and not
> going
> > to push anything to master until new code is stable, we are probably
> good.
> >
> > Thanks,
> > Siddhath
> >
> >
> >
> > On Fri, Oct 13, 2017 at 12:17 PM, Li Jin  wrote:
> >
> >> Siddharth,
> >>
> >> Regarding rename:
> >> Yes this can be done later.
> >>
> >> Tests:
> >> I agree having code like https://github.com/apache/
> >> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
> >> maintain even temporarily. I am not sure what's the best way to resolve
> >> test failure wrt removing of the accessor/mutator from the vectors.
> Maybe
> >> we can have change the template the create non-accessor/mutator
> >> getter/setters and also remove acessor/mutator in the test for it to
> pass?
> >> What do you think is the easiest?
> >>
> >> Reader/Writer:
> >> Yes we can address this later.
> >>
> >> Apologies if I seem to add more work for merging https://github.com/
> >> apache/arrow/pull/1164, that's not my intention, I think the PR looks
> >> good -
> >> just want to bring up some major design decisions so people can comment
> and
> >> discuss.
> >>
> >> Li
> >>
> >>
> >>
> >>
> >> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia  >
> >> wrote:
> >>
> >> > I am not quite sure of the need to rename the vectors. Why do we need
> to
> >> > rename? This would first require us to remove all the vectors
> generated
> >> by
> >> > FixedValueVectors.java as they are non-nullable scalar vectors.
> Removing
> >> > non-nullable vectors is one of the goals, but it can be done once the
> new
> >> > infrastructure is properly setup?
> >> >
> >> > In order to merge the existing patch, I first need to address some
> >> (10-15)
> >> > failures -- few of them are correctness issues w.r.t
> >> TestVectorUnloadLoad,
> >> > TestArrowFile and rest all are related to getMutator(), getAccessor()
> >> > throwing UnsupportedOperationException. This is why I was saying
> earlier
> >> > that I will end up doing a lot of rework by writing redundant code
> where
> >> > (if vector instanceof NullableInt or vector instanceof
> NullableVarChar)
> >> we
> >> > don't use the mutator/accessor and for other vectors we use it for the
> >> > current patch. These if conditions are getting complicated with ugly
> type
> >> > casting in some parts of the code --
> >> > https://github.com/apache/arrow/pull/1164/files#diff-
> >> > 0876c9a0005d1dbaea321ea8d39d79ae
> >> >
> >> > So I thought we can probably implement other vectors (remaining
> scalars,
> >> > map and list) where no vector has mutator/accessor and then for every
> >> > ValueVector, we can remove all calls to getMutator(), getAccessor() as
> >> > opposed to doing them selectively ---
> >> > https://github.com/apache/arrow/pull/1164/files#diff-
> >> > e9273a7b3b35ff7f40f101dc2cf95242
> >> >
> >> > I will try to address these failures by EOD and see if this patch can
> be
> >> > merged first.
> >> >
> >> > Regarding readers and writers, can we address them subsequently?
> >> >
> >> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin 
> wrote:
> >> >
> >> > > Siddharth,
> >> > >
> >> > > Thanks for the update. I think it's fine to move forward with more
> >> > vectors,
> >> > > but in the mean time, I think

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia

Li,

I think there is some confusion. Are you suggesting merging into "java
vector refactor" branch or the master? Is it fine to merge stuff on the
former branch even though few things are broken (around 10 tests) ? If this
is allowed, I can do some cleanup (some documentation, some TODOs suggested
by you and Brian) and we can merge the current patch by EOD or over the
weekend.

Is this okay? Since we are going to iterate over this branch and not going
to push anything to master until new code is stable, we are probably good.

Thanks,
Siddhath



On Fri, Oct 13, 2017 at 12:17 PM, Li Jin  wrote:

> Siddharth,
>
> Regarding rename:
> Yes this can be done later.
>
> Tests:
> I agree having code like https://github.com/apache/
> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to
> maintain even temporarily. I am not sure what's the best way to resolve
> test failure wrt removing of the accessor/mutator from the vectors. Maybe
> we can have change the template the create non-accessor/mutator
> getter/setters and also remove acessor/mutator in the test for it to pass?
> What do you think is the easiest?
>
> Reader/Writer:
> Yes we can address this later.
>
> Apologies if I seem to add more work for merging https://github.com/
> apache/arrow/pull/1164, that's not my intention, I think the PR looks
> good -
> just want to bring up some major design decisions so people can comment and
> discuss.
>
> Li
>
>
>
>
> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia 
> wrote:
>
> > I am not quite sure of the need to rename the vectors. Why do we need to
> > rename? This would first require us to remove all the vectors generated
> by
> > FixedValueVectors.java as they are non-nullable scalar vectors. Removing
> > non-nullable vectors is one of the goals, but it can be done once the new
> > infrastructure is properly setup?
> >
> > In order to merge the existing patch, I first need to address some
> (10-15)
> > failures -- few of them are correctness issues w.r.t
> TestVectorUnloadLoad,
> > TestArrowFile and rest all are related to getMutator(), getAccessor()
> > throwing UnsupportedOperationException. This is why I was saying earlier
> > that I will end up doing a lot of rework by writing redundant code where
> > (if vector instanceof NullableInt or vector instanceof NullableVarChar)
> we
> > don't use the mutator/accessor and for other vectors we use it for the
> > current patch. These if conditions are getting complicated with ugly type
> > casting in some parts of the code --
> > https://github.com/apache/arrow/pull/1164/files#diff-
> > 0876c9a0005d1dbaea321ea8d39d79ae
> >
> > So I thought we can probably implement other vectors (remaining scalars,
> > map and list) where no vector has mutator/accessor and then for every
> > ValueVector, we can remove all calls to getMutator(), getAccessor() as
> > opposed to doing them selectively ---
> > https://github.com/apache/arrow/pull/1164/files#diff-
> > e9273a7b3b35ff7f40f101dc2cf95242
> >
> > I will try to address these failures by EOD and see if this patch can be
> > merged first.
> >
> > Regarding readers and writers, can we address them subsequently?
> >
> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:
> >
> > > Siddharth,
> > >
> > > Thanks for the update. I think it's fine to move forward with more
> > vectors,
> > > but in the mean time, I think we should also prioritize to merge
> > > https://github.com/apache/arrow/pull/1164, here are a few comments
> needs
> > > to
> > > be addressed.
> > >
> > > (1) Backward-compatibility:
> > > I think there is no way to maintain backward compability as the new
> > vector
> > > classes will be renamed, but want to confirm we are OK with this
> > decision.
> > > We also think the disruption on the Spark side are OK as Spark's use
> case
> > > is simple and Bryan and I can take care of the code change.
> > >
> > > (2) Reader/writer classes:
> > > How does the reader/writer classes interact with the new and legacy
> > vector
> > > classes:
> > >
> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> > > r144074264
> > >
> > > My thoughts are:
> > > (1) ArrowReader classes should only return new vector classes
> > > (2) ArrowWriter classes should only work with new vector classes
> > > (3) To read/write legacy vectors, we can use adapters to turn legacy
> > > vectors to new vectors (zero-copy, as the under

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia

I am not quite sure of the need to rename the vectors. Why do we need to
rename? This would first require us to remove all the vectors generated by
FixedValueVectors.java as they are non-nullable scalar vectors. Removing
non-nullable vectors is one of the goals, but it can be done once the new
infrastructure is properly setup?

In order to merge the existing patch, I first need to address some (10-15)
failures -- few of them are correctness issues w.r.t TestVectorUnloadLoad,
TestArrowFile and rest all are related to getMutator(), getAccessor()
throwing UnsupportedOperationException. This is why I was saying earlier
that I will end up doing a lot of rework by writing redundant code where
(if vector instanceof NullableInt or vector instanceof NullableVarChar) we
don't use the mutator/accessor and for other vectors we use it for the
current patch. These if conditions are getting complicated with ugly type
casting in some parts of the code --
https://github.com/apache/arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae

So I thought we can probably implement other vectors (remaining scalars,
map and list) where no vector has mutator/accessor and then for every
ValueVector, we can remove all calls to getMutator(), getAccessor() as
opposed to doing them selectively ---
https://github.com/apache/arrow/pull/1164/files#diff-e9273a7b3b35ff7f40f101dc2cf95242

I will try to address these failures by EOD and see if this patch can be
merged first.

Regarding readers and writers, can we address them subsequently?

On Fri, Oct 13, 2017 at 11:03 AM, Li Jin  wrote:

> Siddharth,
>
> Thanks for the update. I think it's fine to move forward with more vectors,
> but in the mean time, I think we should also prioritize to merge
> https://github.com/apache/arrow/pull/1164, here are a few comments needs
> to
> be addressed.
>
> (1) Backward-compatibility:
> I think there is no way to maintain backward compability as the new vector
> classes will be renamed, but want to confirm we are OK with this decision.
> We also think the disruption on the Spark side are OK as Spark's use case
> is simple and Bryan and I can take care of the code change.
>
> (2) Reader/writer classes:
> How does the reader/writer classes interact with the new and legacy vector
> classes:
>
> Discussion: https://github.com/apache/arrow/pull/1164#discussion_
> r144074264
>
> My thoughts are:
> (1) ArrowReader classes should only return new vector classes
> (2) ArrowWriter classes should only work with new vector classes
> (3) To read/write legacy vectors, we can use adapters to turn legacy
> vectors to new vectors (zero-copy, as the underlying buffers should be
> transferred directly)
>
> Jacques also has a few comments, I don't know if they have been addressed.
>
> For other comments, I think we can add TODO and do it later. I think we can
> merge this PR if we address (1) (2) above.
>
> Comments?
>
>
>
>
>
>
> On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia 
> wrote:
>
> > The patch that I have put up https://github.com/apache/arrow/pull/1198
> > seems to be in a reasonable state. We are now working off a different
> > branch "java vector refactor".
> >
> > Now that we have the basic structure,  in order to make quick forward
> > progress, I would like to go ahead and do for other types (FLOAT, BIGINT
> > etc), list, map and create their legacy
> > counter parts -- doing them in subsequent patches is requiring me to
> write
> > some duplicate code and redundant if conditions in code that expects all
> > the vectors to have mutator/accessor.
> >
> > Is that fine? Just wanted to check with people and ensure there aren't
> any
> > major concerns.
> >
> > The feedback on the PR (original one for master
> > https://github.com/apache/arrow/pull/1164) has been really good -- some
> of
> > the comments are yet to be addressed and we jointly decided to address
> few
> > things (like Minor Type etc) after the refactoring has been done.
> >
> > On the testing front, as far as the correctness is concerned, I have two
> > failures in TestArrowFile and TestValueVector. I have added some more
> tests
> > too.
> >
> >
> >
> >
> >
> > On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia 
> > wrote:
> >
> > > Yes, that is the intention. Good that we all are on the same page. I
> will
> > > move the PR https://github.com/apache/arrow/pull/1164 to new branch.
> > >
> > > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin 
> wrote:
> > >
> > >> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
> > >> deprecated API.
> > >>

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-13 Thread Siddharth Teotia

The patch that I have put up https://github.com/apache/arrow/pull/1198
seems to be in a reasonable state. We are now working off a different
branch "java vector refactor".

Now that we have the basic structure,  in order to make quick forward
progress, I would like to go ahead and do for other types (FLOAT, BIGINT
etc), list, map and create their legacy
counter parts -- doing them in subsequent patches is requiring me to write
some duplicate code and redundant if conditions in code that expects all
the vectors to have mutator/accessor.

Is that fine? Just wanted to check with people and ensure there aren't any
major concerns.

The feedback on the PR (original one for master
https://github.com/apache/arrow/pull/1164) has been really good -- some of
the comments are yet to be addressed and we jointly decided to address few
things (like Minor Type etc) after the refactoring has been done.

On the testing front, as far as the correctness is concerned, I have two
failures in TestArrowFile and TestValueVector. I have added some more tests
too.





On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia 
wrote:

> Yes, that is the intention. Good that we all are on the same page. I will
> move the PR https://github.com/apache/arrow/pull/1164 to new branch.
>
> On Thu, Oct 12, 2017 at 11:20 AM, Li Jin  wrote:
>
>> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
>> deprecated API.
>>
>> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin  wrote:
>>
>> > Siddharth,
>> >
>> > For working off a branch, Wes has created https://github.com/apache/
>> > arrow/tree/java-vector-refactor that we can submit PR to.
>> >
>> > For Legacy vectors, I think it's fine because it's really just a
>> migration
>> > path to help Dremio to migrate to the new vectors. I don't think other
>> > users, i.e., Spark will use the Legacy vector class. Bryan and I will
>> just
>> > migrate Spark to new vectors directly because Spark's use of Arrow is
>> very
>> > simple.
>> >
>> >
>> >
>> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia > >
>> > wrote:
>> >
>> >> Thanks Bryan and Li.
>> >>
>> >> Yes, the goal is to get this (and the subsequent patches) merged to the
>> >> new
>> >> branch. Once it is stabilized from different aspects, we can move to
>> >> master. I am not sure of the exact mechanics when we work off a
>> different
>> >> project branch and not master.
>> >>
>> >> Does that sound good?
>> >>
>> >> Regarding compatibility, are we suggesting that let's not create Legacy
>> >> Nullable vectors at all? The initial thoughts were to generate Legacy
>> >> vectors from NullableValueVectors template and these vectors are
>> >> mutator/accessor based (in today's world). Internally each operation
>> will
>> >> be delegated to new vectors (non code generated).
>> >>
>> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler 
>> wrote:
>> >>
>> >> > Thanks for the update Siddharth.  From the Spark side of this, I
>> >> definitely
>> >> > want to try to upgrade to the latest Arrow before the Spark 2.3
>> release
>> >> but
>> >> > if it the refactor is too disruptive then others might get squeamish
>> >> about
>> >> > upgrading.  On the other hand, I don't think we should hold back on
>> >> > refactoring for compatibility sake and the way it's looking now
>> trying
>> >> to
>> >> > be backwards-compatible will be too much of a pain.  I will try to
>> >> figure
>> >> > out the timeline for Spark 2.3 and what the feeling is for upgrading
>> >> > Arrow.  Can we hold off on merging this to master for now and just
>> work
>> >> out
>> >> > of the separate branch until we can get a better feeling for the
>> impact?
>> >> >
>> >> > Bryan
>> >> >
>> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin 
>> wrote:
>> >> >
>> >> > > Hi Siddharth,
>> >> > >
>> >> > > Thanks for the update. This looks good.
>> >> > >
>> >> > > A few thoughts:
>> >> > >
>> >> > > *Compatibility:*
>> >> > > It sounds like we will introduce some back-compatibility with the
>> new
>> >> > > Vector class. At this point

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-12 Thread Siddharth Teotia

Yes, that is the intention. Good that we all are on the same page. I will
move the PR https://github.com/apache/arrow/pull/1164 to new branch.

On Thu, Oct 12, 2017 at 11:20 AM, Li Jin  wrote:

> To make clear, I think it's fine to have Legacy Vectors in 0.8 as a
> deprecated API.
>
> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin  wrote:
>
> > Siddharth,
> >
> > For working off a branch, Wes has created https://github.com/apache/
> > arrow/tree/java-vector-refactor that we can submit PR to.
> >
> > For Legacy vectors, I think it's fine because it's really just a
> migration
> > path to help Dremio to migrate to the new vectors. I don't think other
> > users, i.e., Spark will use the Legacy vector class. Bryan and I will
> just
> > migrate Spark to new vectors directly because Spark's use of Arrow is
> very
> > simple.
> >
> >
> >
> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia 
> > wrote:
> >
> >> Thanks Bryan and Li.
> >>
> >> Yes, the goal is to get this (and the subsequent patches) merged to the
> >> new
> >> branch. Once it is stabilized from different aspects, we can move to
> >> master. I am not sure of the exact mechanics when we work off a
> different
> >> project branch and not master.
> >>
> >> Does that sound good?
> >>
> >> Regarding compatibility, are we suggesting that let's not create Legacy
> >> Nullable vectors at all? The initial thoughts were to generate Legacy
> >> vectors from NullableValueVectors template and these vectors are
> >> mutator/accessor based (in today's world). Internally each operation
> will
> >> be delegated to new vectors (non code generated).
> >>
> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler 
> wrote:
> >>
> >> > Thanks for the update Siddharth.  From the Spark side of this, I
> >> definitely
> >> > want to try to upgrade to the latest Arrow before the Spark 2.3
> release
> >> but
> >> > if it the refactor is too disruptive then others might get squeamish
> >> about
> >> > upgrading.  On the other hand, I don't think we should hold back on
> >> > refactoring for compatibility sake and the way it's looking now trying
> >> to
> >> > be backwards-compatible will be too much of a pain.  I will try to
> >> figure
> >> > out the timeline for Spark 2.3 and what the feeling is for upgrading
> >> > Arrow.  Can we hold off on merging this to master for now and just
> work
> >> out
> >> > of the separate branch until we can get a better feeling for the
> impact?
> >> >
> >> > Bryan
> >> >
> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin 
> wrote:
> >> >
> >> > > Hi Siddharth,
> >> > >
> >> > > Thanks for the update. This looks good.
> >> > >
> >> > > A few thoughts:
> >> > >
> >> > > *Compatibility:*
> >> > > It sounds like we will introduce some back-compatibility with the
> new
> >> > > Vector class. At this point I think our main Java users should be
> >> Spark
> >> > and
> >> > > Dremio, is this right?
> >> > >
> >> > >
> >> > >- For Spark:
> >> > >
> >> > > It seems fine since Spark uses just the basic functionality of
> Vector
> >> > > classes and the existing code should work with the new Vector
> classes,
> >> > > maybe even without any code change on the Spark side.
> >> > >
> >> > >
> >> > >- For Dremio:
> >> > >
> >> > > Sounds like you are already taking care of this by introducing the
> >> > > LegacyVector classes.
> >> > >
> >> > >
> >> > > *Testing:*
> >> > >
> >> > >- Spark Integration Tests:
> >> > >
> >> > > Bryan and I can help with integration test with Spark. I think the
> >> target
> >> > > timeline for Spark 2.3 release is some time in mid Nov (Bryan please
> >> > > correct me if I am wrong).
> >> > >
> >> > > I will take a look at the PR today.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > On Tue, Oct 10,

Re: Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-12 Thread Siddharth Teotia

Thanks Bryan and Li.

Yes, the goal is to get this (and the subsequent patches) merged to the new
branch. Once it is stabilized from different aspects, we can move to
master. I am not sure of the exact mechanics when we work off a different
project branch and not master.

Does that sound good?

Regarding compatibility, are we suggesting that let's not create Legacy
Nullable vectors at all? The initial thoughts were to generate Legacy
vectors from NullableValueVectors template and these vectors are
mutator/accessor based (in today's world). Internally each operation will
be delegated to new vectors (non code generated).

On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler  wrote:

> Thanks for the update Siddharth.  From the Spark side of this, I definitely
> want to try to upgrade to the latest Arrow before the Spark 2.3 release but
> if it the refactor is too disruptive then others might get squeamish about
> upgrading.  On the other hand, I don't think we should hold back on
> refactoring for compatibility sake and the way it's looking now trying to
> be backwards-compatible will be too much of a pain.  I will try to figure
> out the timeline for Spark 2.3 and what the feeling is for upgrading
> Arrow.  Can we hold off on merging this to master for now and just work out
> of the separate branch until we can get a better feeling for the impact?
>
> Bryan
>
> On Wed, Oct 11, 2017 at 8:19 AM, Li Jin  wrote:
>
> > Hi Siddharth,
> >
> > Thanks for the update. This looks good.
> >
> > A few thoughts:
> >
> > *Compatibility:*
> > It sounds like we will introduce some back-compatibility with the new
> > Vector class. At this point I think our main Java users should be Spark
> and
> > Dremio, is this right?
> >
> >
> >- For Spark:
> >
> > It seems fine since Spark uses just the basic functionality of Vector
> > classes and the existing code should work with the new Vector classes,
> > maybe even without any code change on the Spark side.
> >
> >
> >- For Dremio:
> >
> > Sounds like you are already taking care of this by introducing the
> > LegacyVector classes.
> >
> >
> > *Testing:*
> >
> >- Spark Integration Tests:
> >
> > Bryan and I can help with integration test with Spark. I think the target
> > timeline for Spark 2.3 release is some time in mid Nov (Bryan please
> > correct me if I am wrong).
> >
> > I will take a look at the PR today.
> >
> >
> >
> >
> >
> >
> > On Tue, Oct 10, 2017 at 4:29 PM, Siddharth Teotia 
> > wrote:
> >
> > > Hi All,
> > >
> > > I wanted to update everyone on state of this mini-project:
> > >
> > >
> > >- Requirements document and initial design proposal were sent out to
> > the
> > >community for review and we have received some good feedback. All
> > > required
> > >docs are attached with corresponding JIRAs.
> > >
> > >
> > >- The initial prototype is in a reasonable state (code-complete).
> You
> > >can see the PR here - https://github.com/apache/arrow/pull/1164
> > >
> > >
> > >- The prototype has code changes for the new hierarchy, abstract
> > >interfaces for fixed width and variable width vectors and concrete
> > >implementation of NullableIntVector and NullableVarCharVector.
> > >
> > >
> > > Plan for testing and integrating into existing infrastructure:
> > >
> > >
> > >- My initial thoughts are that this particular patch will require a
> > lot
> > >of testing, reviews etc since the foundation of rest of the
> > > implementation
> > >more or less depends on how the APIs are flushed out here.
> > >
> > >
> > >- So the goal is to get this properly tested and merged into master
> > >first.
> > >
> > >
> > >- The idea is to slowly deprecate and remove the existing vectors in
> > >stages. In this patch itself, we change the existing
> > >NullableValueVectors.java template to generate
> LegacyNullableIntVector
> > > and
> > >LegacyNullableVarCharVector. Each operation on these vectors will
> > > delegate
> > >to the corresponding NullableIntVector and NullableVarCharVector
> that
> > > are
> > >newly implemented.
> > >
> > >
> > >- This achieves two goals w.r.t testing:
> > >
> > >
> > >- Firstly, our existing JAVA unit

Update on ARROW-1463, related subtasks and plan for testing and merging

2017-10-10 Thread Siddharth Teotia

Hi All,

I wanted to update everyone on state of this mini-project:


   - Requirements document and initial design proposal were sent out to the
   community for review and we have received some good feedback. All required
   docs are attached with corresponding JIRAs.


   - The initial prototype is in a reasonable state (code-complete). You
   can see the PR here - https://github.com/apache/arrow/pull/1164


   - The prototype has code changes for the new hierarchy, abstract
   interfaces for fixed width and variable width vectors and concrete
   implementation of NullableIntVector and NullableVarCharVector.


Plan for testing and integrating into existing infrastructure:


   - My initial thoughts are that this particular patch will require a lot
   of testing, reviews etc since the foundation of rest of the implementation
   more or less depends on how the APIs are flushed out here.


   - So the goal is to get this properly tested and merged into master
   first.


   - The idea is to slowly deprecate and remove the existing vectors in
   stages. In this patch itself, we change the existing
   NullableValueVectors.java template to generate LegacyNullableIntVector and
   LegacyNullableVarCharVector. Each operation on these vectors will delegate
   to the corresponding NullableIntVector and NullableVarCharVector that are
   newly implemented.


   - This achieves two goals w.r.t testing:


   - Firstly, our existing JAVA unit tests will automatically exercise the
  newly written code and its APIs (API names have not changed) for
  NullableInt and NullableVarChar vectors.


   - Secondly, let's say we rebase Dremio on top of Arrow master and
  replace all references to NullableIntVector and
NullableVarCharVector with
  their Legacy counterparts, things should still work.


   - After this patch gets merged, we can do the following work in multiple
   patches:
  - Write concrete implementations for rest of the nullable types --
  FLOAT4, FLOAT8, BIGINT, VARBINARY etc


   - Write additional tests (definitely needed but the first goal is to
  make sure existing tests are not broken).


   - Ensure NullableValueVectors template generates Legacy vectors and each
  operation is merely a delegation to the API in new implementation.


   - In the next Arrow release, remove all Legacy vectors and
  NullableValueVectors template since we will have the implementation for
  each type that passes existing tests.


   - I am currently inspecting the newly written code and making changes to
  the template to generate Legacy vector types for Nullable Int
and Nullable
  VarChar and delegating the operations. The changes should be available in
  the PR in a couple of hours.


I am wondering if there are any other ideas around testing, merging etc.
Please feel free to reply here or comment on the PR.

I would appreciate if people can take time to review the code in PR --
especially the abstract classes BaseNullableFixedWidth and
BaseNullableVariableWidth. Writing concrete implementations for other types
will be much less hassle if these abstract classes have proper code.

Thanks,
Siddharth

[jira] [Created] (ARROW-1655) Add Scale and Precision to ValueVectorTypes.tdd for Decimals

2017-10-05 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1655:
---

 Summary: Add Scale and Precision to ValueVectorTypes.tdd for 
Decimals
 Key: ARROW-1655
 URL: https://issues.apache.org/jira/browse/ARROW-1655
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Question regarding scope of Arrow

2017-10-04 Thread Siddharth Teotia

I think it's a good idea to have SIMD support inbuilt in Arrow libraries.
Simple analytic operations like SUM, MIN, MAX, COUNT, AVG, FILTER
(especially for fixed width values and dictionary encoded columns) can be
made substantially faster by providing APIs that internally use SIMD
(probably through Intel compiler intrinsics) to accomplish these SQL
operations on the columnar data structures.

We had taken inspirations from HANA paper when implementing SIMD based scan
operations for predicate evaluation on in-memory columnar data at Oracle.

http://www.vldb.org/pvldb/2/vldb09-327.pdf

On Wed, Oct 4, 2017 at 8:08 PM, Wes McKinney  wrote:

> hi Paddy,
>
> Thanks for bringing this up. Some responses inline
>
> On Wed, Oct 4, 2017 at 10:31 PM, paddy horan 
> wrote:
> > Hi All,
> >
> > I’m hoping someone on this list can comment on the scope of Arrow.  In
> the interview with Wes for O’Reilly he spoke about an “operator kernel
> library”.  On the homepage it states that Arrow “enables execution engines
> to take advantage of the latest SIMD…”.  Is this “operator kernel library”
> a part of Arrow or will it be a separate “execution engine” library that is
> built on top of Arrow.  It seems to me that is will be a part of Arrow, is
> my understanding correct?
> >
>
> Yes, this is correct. The idea of these operator kernels is that users
> of the Arrow libraries can use them to craft user-facing libraries
> with whatever semantics they wish. It wouldn't make sense for there to
> be divergent implementations of essential primitive array functions
> like:
>
> * Binary arithmetic
> * Array manipulations (take, head, tail, repeat)
> * Sorting
> * Unary mathematics operators (sqrt, exp, log, etc)
> * Hash-table based functions (unique, match, isin, dictionary-encode)
> * Missing-data aware reductions
>
> Through zero-copy adapter layers we can enable external analytic
> kernels (like NumPy ufuncs, for example) to be "plugged in" and layer
> on Arrow's missing data after the fact, but in many cases it may be
> better to have a native implementation against the Arrow memory
> layout.
>
> The intent would be that these kernels are building blocks for
> building general Arrow-based execution engines. It is probably going
> to make sense to have a canonical implementation of such an "engine"
> inside the Arrow project itself.
>
> > If this is the case, what is the scope of such a library?  Taking pandas
> 2.0 as an example, do you plan to have pandas be a wrapper around Arrow?
> Arrow being the “libpandas” referred to in the design document for pandas
> 2.0 maybe?
> >
>
> This is probably too nuanced a discussion for a single mailing list
> thread; I hope to better articulate the layering of technologies that
> will form "pandas2". Arrow shall provide primitive columnar array
> analytics and most likely (per comment above) also a single-node graph
> dataflow-style execution engine (in the style of TensorFlow and other
> frameworks you may be familiar with).
>
> Defining the semantics of how data frames work in Python is quite a
> bit of work. For example, how does mutation work? When it data loaded
> and materialized into memory? How does spilling to disk or out-of-core
> / streaming algorithms work? There will be many pandas-specific
> opinionated design decisions that we will need to make to shape a
> Pythonic experience that existing "pandas0" (or pandas1?) users can
> pick up easily. We ought not foist these opinionated decisions on the
> Arrow project.
>
> On this last point, there was a period of time from 2010 to 2012 where
> I was under some amount of criticism from the scientific Python
> community for not building parts of pandas as patches into NumPy. I
> felt that the changes that would be needed in NumPy to accommodate the
> way I wanted pandas to work would be deemed inappropriate by the NumPy
> user community. We may face similar challenges here and we'll need to
> draw the line so Arrow can stay "pure" and general purpose.
>
> So the TL;DR on this:
>
> - Arrow: memory representation / management, metadata, IO / data
> ingest / memory-mapping, efficient + mathematically precise analytics
> - pandas/pandas2: user-facing Python semantics, pandas-specific
> extensions -- under the hood we can manipulate Arrow memory and hide
> internal complexities as needed
>
> > “we have not decided” is a valid response to any/all of the questions
> above.  Apologies if these are basic questions.  I’m excited about the
> project and where it could go.  I’m an Actuary looking to build an
> Actuarial modeling library on top of Arrow and I would love to contribute.
> However, I feel I have a lot to learn first.  Is there a better forum for
> basic questions from would be new contributors?  (I won’t be offended if
> you tell me that there is no forum for basic questions, I understand that
> momentum is important and you are all busy moving the project toward 1.0)
> >
>
> This is the right place for now, and the project's governanc

Re: Arrow sync call tomorrow 4 October @ 16:00 UTC

2017-10-04 Thread Siddharth Teotia

I am out for doctor's appointment. I may have to miss it this time or I
might join a bit late.

On Oct 4, 2017 7:16 AM, "Wes McKinney"  wrote:

Heimir has offered to create a Hangout that should accommodate up to
25 participants:

https://plus.google.com/hangouts/_/mojotech.com/array-sync?hceid=
aGVpbWlyQG1vam90ZWNoLmNvbQ.5cjurieavbah0d8pc6kbum5k6v

Let's give this a shot.

Thanks
Wes

On Tue, Oct 3, 2017 at 12:36 PM, Wes McKinney  wrote:
> hi folks,
>
> We will be having our biweekly call again tomorrow. All are welcome to
> join; meeting notes will be published on the mailing list for further
> discussion and transparency.
>
> Can someone assist me with setting up a Google Meet call so that we
> can accommodate more than 10 attendees (we maxed out on a normal
> Google Hangout last time)?
>
> Thank you
> Wes

Re: ARROW-1463: SubTask ARROW-1472: Design updated Value Vector hierarchy.

2017-10-03 Thread Siddharth Teotia

Anyway, I will create a PR in some time for the WIP prototype. I think once
people eyeball the code there, we may have a consensus.


On Tue, Oct 3, 2017 at 3:53 PM, Siddharth Teotia 
wrote:

> Li,
>
> This is exactly what I was referring to in my previous email. I think if
> we have the opportunity (which we have now), we should see if complex
> template code like this can be removed unless it is absolutely necessary.
> This is why I was suggesting if we can just not have any code generated
> classes.
>
> Thanks,
> Siddharth
>
> On Tue, Oct 3, 2017 at 3:03 PM, Li Jin  wrote:
>
>> Siddharth,
>>
>> Thanks for the update. Without really sit down and do the prototype, my
>> opinions can be wrong. But,
>>
>> I think a lot of the complication are code like this:
>>
>> <#elseif minor.class == "Decimal">
>> public void get(int index, ${minor.class}Holder holder) {
>> holder.start = index * ${type.width};
>> holder.buffer = data;
>> holder.scale = scale;
>> holder.precision = precision;
>> }
>>
>> The fact that we have to have a special block for Decimal is eyesore to
>> me.
>> And it's not really shared with any other classes.
>>
>> Things like these are not great but might be ok?
>>
>> <#if type.width == 4>
>> public long getTwoAsLong(int index) {
>> return data.getLong(index * ${type.width});
>> }
>>
>> 
>>
>> Sorry I couldn't provide too much useful feedback without digging into the
>> template, but this is any general feeling about these templates - too many
>> "if" to types like "Interval" "Decimal" "Timestamp"
>>
>>
>>
>> On Tue, Oct 3, 2017 at 3:59 PM, Siddharth Teotia 
>> wrote:
>>
>> > I am in the middle of a simple prototype that has the basic
>> implementation
>> > of BaseFixedWidthVector, FixedValueVectorsPrototype.java (template) to
>> > generate a simple IntVector using the proposal mentioned in the
>> document.
>> >
>> > I have realized that even though the LOCs in existing templates are
>> reduced
>> > by 30-40% since bunch of common/basic functionality is moved to super
>> class
>> > BaseFixedWidthVector, the major source of pain (giant and complex if
>> > conditions) associated with code generation is in accessor and mutator
>> > which is still part of templates.
>> >
>> > I am trying to err on the side of not using templates at all since I
>> feel
>> > there is not much of gain from this refactoring project if the code in
>> > templates is still complex and requires regular addition/modification
>> when
>> > adding new types. We are probably better off writing multiple sub
>> classes
>> > (with duplicate code as applicable)
>> >
>> > Thoughts?
>> >
>> > I can create a PR from this prototype code once it in reasonable shape
>> for
>> > review but was wondering if people have any opinion.
>> >
>> > Thanks,
>> > Sidd
>> >
>> > On Tue, Oct 3, 2017 at 3:16 AM, Siddharth Teotia 
>> > wrote:
>> >
>> > > Hi All,
>> > >
>> > > You should have received an invitation to edit the following document.
>> > > Please feel free to add comments or additional content.
>> > >
>> > > https://docs.google.com/document/d/1rl0PK5OnbQAnFUrhd4bQPtP0u7930
>> > > sBKKaiyggOY7t4/edit
>> > >
>> > > Let me know if the document is not editable.
>> > >
>> > > Thanks,
>> > > Siddharth
>> > >
>> >
>>
>
>

Re: ARROW-1463: SubTask ARROW-1472: Design updated Value Vector hierarchy.

2017-10-03 Thread Siddharth Teotia

Li,

This is exactly what I was referring to in my previous email. I think if we
have the opportunity (which we have now), we should see if complex template
code like this can be removed unless it is absolutely necessary. This is
why I was suggesting if we can just not have any code generated classes.

Thanks,
Siddharth

On Tue, Oct 3, 2017 at 3:03 PM, Li Jin  wrote:

> Siddharth,
>
> Thanks for the update. Without really sit down and do the prototype, my
> opinions can be wrong. But,
>
> I think a lot of the complication are code like this:
>
> <#elseif minor.class == "Decimal">
> public void get(int index, ${minor.class}Holder holder) {
> holder.start = index * ${type.width};
> holder.buffer = data;
> holder.scale = scale;
> holder.precision = precision;
> }
>
> The fact that we have to have a special block for Decimal is eyesore to me.
> And it's not really shared with any other classes.
>
> Things like these are not great but might be ok?
>
> <#if type.width == 4>
> public long getTwoAsLong(int index) {
> return data.getLong(index * ${type.width});
> }
>
> 
>
> Sorry I couldn't provide too much useful feedback without digging into the
> template, but this is any general feeling about these templates - too many
> "if" to types like "Interval" "Decimal" "Timestamp"
>
>
>
> On Tue, Oct 3, 2017 at 3:59 PM, Siddharth Teotia 
> wrote:
>
> > I am in the middle of a simple prototype that has the basic
> implementation
> > of BaseFixedWidthVector, FixedValueVectorsPrototype.java (template) to
> > generate a simple IntVector using the proposal mentioned in the document.
> >
> > I have realized that even though the LOCs in existing templates are
> reduced
> > by 30-40% since bunch of common/basic functionality is moved to super
> class
> > BaseFixedWidthVector, the major source of pain (giant and complex if
> > conditions) associated with code generation is in accessor and mutator
> > which is still part of templates.
> >
> > I am trying to err on the side of not using templates at all since I feel
> > there is not much of gain from this refactoring project if the code in
> > templates is still complex and requires regular addition/modification
> when
> > adding new types. We are probably better off writing multiple sub classes
> > (with duplicate code as applicable)
> >
> > Thoughts?
> >
> > I can create a PR from this prototype code once it in reasonable shape
> for
> > review but was wondering if people have any opinion.
> >
> > Thanks,
> > Sidd
> >
> > On Tue, Oct 3, 2017 at 3:16 AM, Siddharth Teotia 
> > wrote:
> >
> > > Hi All,
> > >
> > > You should have received an invitation to edit the following document.
> > > Please feel free to add comments or additional content.
> > >
> > > https://docs.google.com/document/d/1rl0PK5OnbQAnFUrhd4bQPtP0u7930
> > > sBKKaiyggOY7t4/edit
> > >
> > > Let me know if the document is not editable.
> > >
> > > Thanks,
> > > Siddharth
> > >
> >
>

Re: ARROW-1463: SubTask ARROW-1472: Design updated Value Vector hierarchy.

2017-10-03 Thread Siddharth Teotia

I am in the middle of a simple prototype that has the basic implementation
of BaseFixedWidthVector, FixedValueVectorsPrototype.java (template) to
generate a simple IntVector using the proposal mentioned in the document.

I have realized that even though the LOCs in existing templates are reduced
by 30-40% since bunch of common/basic functionality is moved to super class
BaseFixedWidthVector, the major source of pain (giant and complex if
conditions) associated with code generation is in accessor and mutator
which is still part of templates.

I am trying to err on the side of not using templates at all since I feel
there is not much of gain from this refactoring project if the code in
templates is still complex and requires regular addition/modification when
adding new types. We are probably better off writing multiple sub classes
(with duplicate code as applicable)

Thoughts?

I can create a PR from this prototype code once it in reasonable shape for
review but was wondering if people have any opinion.

Thanks,
Sidd

On Tue, Oct 3, 2017 at 3:16 AM, Siddharth Teotia 
wrote:

> Hi All,
>
> You should have received an invitation to edit the following document.
> Please feel free to add comments or additional content.
>
> https://docs.google.com/document/d/1rl0PK5OnbQAnFUrhd4bQPtP0u7930
> sBKKaiyggOY7t4/edit
>
> Let me know if the document is not editable.
>
> Thanks,
> Siddharth
>

Re: [ANNOUNCE] New Arrow committers: Phillip Cloud and Bryan Cutler

2017-10-03 Thread Siddharth Teotia

Congrats Philip and Bryan :)

On Tue, Oct 3, 2017 at 11:57 AM, Holden Karau  wrote:

> Congrats to the both of you :) Really excited to see the Areow project
> continue to grow :)
>
> On Tue, Oct 3, 2017 at 10:24 AM Julian Hyde  wrote:
>
> > Congratulations and welcome, Philip and Bryan!
> >
> > > On Oct 3, 2017, at 5:27 AM, Wes McKinney  wrote:
> > >
> > > On behalf of the Arrow PMC, I'm pleased to announce that Phillip Cloud
> > > and Bryan Cutler have been invited to be Arrow committers.
> > >
> > > We are grateful for your contributions to the project and look forward
> > > to growing the community together.
> > >
> > > Welcome, Phillip and Bryan, and congrats!
> > >
> > > - Wes
> >
> > --
> Twitter: https://twitter.com/holdenkarau
>

ARROW-1463: SubTask ARROW-1472: Design updated Value Vector hierarchy.

2017-10-03 Thread Siddharth Teotia

Hi All,

You should have received an invitation to edit the following document.
Please feel free to add comments or additional content.

https://docs.google.com/document/d/1rl0PK5OnbQAnFUrhd4bQPtP0u7930sBKKaiyggOY7t4/edit

Let me know if the document is not editable.

Thanks,
Siddharth

Re: ARROW-1463: Subtask ARROW-1471 Requirements for Value Vector Updates

2017-10-01 Thread Siddharth Teotia

Thanks Li.

I will wait for a day to receive any further comments before marking
ARROW-1471 as resolved -- not to imply that requirements are set "in stone"
and nothing can be changed. This is just to make sure we are making
progress.

Thanks,
Siddharth

On Sun, Oct 1, 2017 at 12:28 PM, Li Jin  wrote:

> Siddharth,
>
> The requirement doc looks good to me. Thanks for putting this together.
>
> Li
>
> On Sun, Oct 1, 2017 at 1:20 PM, Siddharth Teotia 
> wrote:
>
> > Hi All,
> >
> > I am assuming that there are no more requirements to be added/suggested.
> If
> > so, we can proceed with designing and prototyping as mentioned in other
> > subtasks.
> >
> > Recently we discovered requirement w.r.t heap usage in vector/memory
> code.
> > So that is something that I need to note down in the requirement doc. I
> > have already created corresponding JIRAs and linked to ARROW-1463.
> >
> > Thanks,
> > Siddharth
> >
> > On Thu, Sep 21, 2017 at 4:11 PM, Bryan Cutler  wrote:
> >
> > > Thanks for doing this Siddharth!  The link in your email isn't
> clickable
> > > for me, so for us folks too lazy to cut and paste I'll try adding it
> > again
> > >
> > > https://docs.google.com/document/d/1ysZ76zritBDwkeQz3C6-
> > > vhQwD32jEXd1kUF4T936G1U/edit
> > >
> > > On Thu, Sep 21, 2017 at 2:18 PM, Siddharth Teotia <
> siddha...@dremio.com>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > You should have received an invitation to edit the following
> document.
> > > > Please feel free to add comments or additional content.
> > > >
> > > > https://docs.google.com/document/d/1ysZ76zritBDwkeQz3C6-
> > > > vhQwD32jEXd1kUF4T936G1U/edit?usp=sharing
> > > >
> > > > Thanks,
> > > > Siddharth
> > > >
> > >
> >
>

Re: ARROW-1463: Subtask ARROW-1471 Requirements for Value Vector Updates

2017-10-01 Thread Siddharth Teotia

Hi All,

I am assuming that there are no more requirements to be added/suggested. If
so, we can proceed with designing and prototyping as mentioned in other
subtasks.

Recently we discovered requirement w.r.t heap usage in vector/memory code.
So that is something that I need to note down in the requirement doc. I
have already created corresponding JIRAs and linked to ARROW-1463.

Thanks,
Siddharth

On Thu, Sep 21, 2017 at 4:11 PM, Bryan Cutler  wrote:

> Thanks for doing this Siddharth!  The link in your email isn't clickable
> for me, so for us folks too lazy to cut and paste I'll try adding it again
>
> https://docs.google.com/document/d/1ysZ76zritBDwkeQz3C6-
> vhQwD32jEXd1kUF4T936G1U/edit
>
> On Thu, Sep 21, 2017 at 2:18 PM, Siddharth Teotia 
> wrote:
>
> > Hi All,
> >
> > You should have received an invitation to edit the following document.
> > Please feel free to add comments or additional content.
> >
> > https://docs.google.com/document/d/1ysZ76zritBDwkeQz3C6-
> > vhQwD32jEXd1kUF4T936G1U/edit?usp=sharing
> >
> > Thanks,
> > Siddharth
> >
>

[jira] [Created] (ARROW-1621) Reduce Heap Usage per Vector

2017-09-27 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1621:
---

 Summary: Reduce Heap Usage per Vector
 Key: ARROW-1621
 URL: https://issues.apache.org/jira/browse/ARROW-1621
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Memory, Java - Vectors
Reporter: Siddharth Teotia


https://docs.google.com/document/d/1MU-ah_bBHIxXNrd7SkwewGCOOexkXJ7cgKaCis5f-PI/edit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1618) See if the heap usage in vectors can be reduced.

2017-09-27 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1618:
---

 Summary: See if the heap usage in vectors can be reduced.
 Key: ARROW-1618
 URL: https://issues.apache.org/jira/browse/ARROW-1618
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java - Memory, Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We have seen in our tests that there is some scope of improvement as far as the 
number of objects and/or sizing of some data structures is concerned.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

ARROW-1463: Subtask ARROW-1471 Requirements for Value Vector Updates

2017-09-21 Thread Siddharth Teotia

Hi All,

You should have received an invitation to edit the following document.
Please feel free to add comments or additional content.

https://docs.google.com/document/d/1ysZ76zritBDwkeQz3C6-
vhQwD32jEXd1kUF4T936G1U/edit?usp=sharing

Thanks,
Siddharth

[jira] [Created] (ARROW-1553) Implement setInitialCapacity for MapWriter and pass on this capacity during lazy creation of child vectors

2017-09-18 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1553:
---

 Summary: Implement setInitialCapacity for MapWriter and pass on 
this capacity during lazy creation of child vectors
 Key: ARROW-1553
 URL: https://issues.apache.org/jira/browse/ARROW-1553
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1547) Fix 8x memory over-allocation in BitVector

2017-09-17 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1547:
---

 Summary: Fix 8x memory over-allocation in BitVector
 Key: ARROW-1547
 URL: https://issues.apache.org/jira/browse/ARROW-1547
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


Typically there are 3 ways of specifying the amount of memory needed for 
vectors.

CASE (1) allocateNew() -- here the application doesn't really specify the size 
of memory or value count. Each vector type has a default value count (4096) and 
therefore a default size (in bytes) is used in such cases.

For example, for a 4 byte fixed-width vector, we will allocate 32KB of memory 
for a call to allocateNew().

CASE (2) setInitialCapacity(count) followed by allocateNew() - In this case 
also the application doesn't specify the value count or size in allocateNew(). 
However, the call to setInitialCapacity() dictates the amount of memory the 
subsequent call to allocateNew() will allocate.

For example, we can do setInitialCapacity(1024) and the call to allocateNew() 
will allocate 4KB of memory for the 4 byte fixed-width vector.

CASE (3) allocateNew(count) - The application is specific about requirements.

For nullable vectors, the above calls also allocate the memory for validity 
vector.

The problem is that Bit Vector uses a default memory size in bytes of 4096. In 
other words, we allocate a vector for 4096*8 value count.

In the default case (as explained above), the vector types have a value count 
of 4096 so we need only 4096 bits (512 bytes) in the bit vector and not really 
4096 as the size in bytes.

This happens in CASE 1 where the application depends on the default memory 
allocation . In such cases, the size of buffer for bit vector is 8x than 
actually needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1533) As part of buffer transfer (transferTo function), we should transfer the state for realloc

2017-09-12 Thread Siddharth Teotia (JIRA)

Siddharth Teotia created ARROW-1533:
---

 Summary: As part of buffer transfer (transferTo function), we 
should transfer the state for realloc
 Key: ARROW-1533
 URL: https://issues.apache.org/jira/browse/ARROW-1533
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Siddharth Teotia
Assignee: Siddharth Teotia


We recently encountered a problem when we were trying to add JSON files with 
complex schema as datasets.

Initially we started with a Float8Vector with default memory allocation of 
(4096 * 8) 32KB.
Went through several iterations of setSafe() to trigger a realloc() from 32KB 
to 64KB.
Another round of setSafe() calls to trigger a realloc() from 64KB to 128KB

After that we encountered a BigInt and promoted our vector to UnionVector.

This required us to create a UnionVector with BigIntVector and Float8Vector. 
The latter required us to transfer the Float8Vector we were earlier working 
with to the Float8Vector inside the Union.

As part of transferTo(), the target Float8Vector got all the ArrowBuf state 
(capacity, buffer contents) etc transferred from the source vector.

Later, a realloc was triggered on the Float8Vector inside the UnionVector.

The computation inside realloc() to determine the amount of memory to be 
reallocated goes wrong since it makes the decision based on allocateSizeInBytes 
-- although this vector was created as part of transfer() from 128KB source 
vector, allocateSizeInBytes is still at the initial/default value of 32KB

We end up allocating a 64KB buffer and attempt to copy 128KB over 64KB and seg 
fault when invoking setBytes().

There is a wrong assumption in realloc() that allocateSizeInBytes is always 
equal to data.capacity(). The particular scenario described above exposes where 
this assumption could go wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Re: Travis CI problems

2017-09-07 Thread Siddharth Teotia

Thanks a lot. I will take a stab at finding the root cause of these
failures.

On Sep 7, 2017 6:57 PM, "Wes McKinney"  wrote:

> I submitted a patch to allow failures in this entry in the build
> matrix for the time being
>
> https://github.com/apache/arrow/pull/1064
>
> On Thu, Sep 7, 2017 at 9:08 PM, Wes McKinney  wrote:
> > That's from the JDK9 build which was only added on August 17
> > (https://github.com/apache/arrow/commit/4ef7c898bb82cd3513e0ad3d80730e
> 29ebaeb60e#diff-93f725a07423fe1c889f448b33d21f46).
> > The flakiness started sometime in the last 3 days.
> >
> > If Laurent or someone with Java background could investigate the cause
> > and either fix or disable entry in the Travis build matrix, that would
> > be great.
> >
> > Thanks
> >
> > On Thu, Sep 7, 2017 at 7:18 PM, Siddharth Teotia 
> wrote:
> >> Is anyone else seeing the following failures in Travis CI build? I am
> >> seeing these problems for PR https://github.com/apache/arrow/pull/1052
> >>
> >> I looked at the raw log and nothing seems to indicate problems w.r.t
> code
> >> changes.
> >>
> >> travis_time:end:082db3a8:start=1504813701624387520,
> finish=1504813701628120703,duration=3733183
> >>  [0Ktravis_fold:end:before_install
> >>  [0Ktravis_time:start:0ca495aa
> >>  [0K$ $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
> >> ~/build/apache/arrow/java ~/build/apache/arrow
> >> The JAVA_HOME environment variable is not defined correctly
> >> This environment variable is needed to run this program
> >> NB: JAVA_HOME should point to a JDK not a JRE
>

Travis CI problems

2017-09-07 Thread Siddharth Teotia

Is anyone else seeing the following failures in Travis CI build? I am
seeing these problems for PR https://github.com/apache/arrow/pull/1052

I looked at the raw log and nothing seems to indicate problems w.r.t code
changes.

travis_time:end:082db3a8:start=1504813701624387520,finish=1504813701628120703,duration=3733183
 [0Ktravis_fold:end:before_install
 [0Ktravis_time:start:0ca495aa
 [0K$ $TRAVIS_BUILD_DIR/ci/travis_script_java.sh
~/build/apache/arrow/java ~/build/apache/arrow
The JAVA_HOME environment variable is not defined correctly
This environment variable is needed to run this program
NB: JAVA_HOME should point to a JDK not a JRE

ARROW-1463 subtask assignments - https://issues.apache.org/jira/browse/ARROW-1463

2017-09-07 Thread Siddharth Teotia

Hi All,

I am wondering if anyone is interested in working on sub-tasks for
ARROW-1463. Please feel free to grab the child JIRAs.

Thanks,
Siddharth

[jira] [Created] (ARROW-1478) clear should release the buffer only if the buffer is not NULL

2017-09-06 Thread SIDDHARTH TEOTIA (JIRA)

SIDDHARTH TEOTIA created ARROW-1478:
---

 Summary: clear should release the buffer only if the buffer is not 
NULL
 Key: ARROW-1478
 URL: https://issues.apache.org/jira/browse/ARROW-1478
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: SIDDHARTH TEOTIA
Assignee: SIDDHARTH TEOTIA


In some cases we use a fake allocator in Dremio for the purpose of field 
materialization only. The buffers of the underlying vectors are not allocated. 
Fake allocator is a simple implementation of BufferAllocator interface where 
almost every method throws UnsupportedOperation exception and methods like 
getEmpty() return NULL.

It is more like a pass-through mechanism that allows us to be able to 
instantiate a vector using a non-functional allocator since the constructors in 
vector code don't allow for the allocator itself to be NULL.

Portions of code where we have this scenario are generic in nature and so have 
typical methods like close() / clear() which underneath invoke the 
corresponding methods on vectors.

The clear() method in BaseDataValueVector releases the data buffer without 
checking if the buffer is NULL and that's where callers hit NPE.

We don't see such problems in Arrow unit tests. My guess is that when a vector 
is instantiated, the buffer is still probably a valid reference returned 
through allocator.getEmpty() call in the constructor of BaseDataValueVector.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1467) Fix reset() and allocateNew() in Nullable Value Vectors template

2017-09-05 Thread SIDDHARTH TEOTIA (JIRA)

SIDDHARTH TEOTIA created ARROW-1467:
---

 Summary: Fix reset() and allocateNew() in Nullable Value Vectors 
template
 Key: ARROW-1467
 URL: https://issues.apache.org/jira/browse/ARROW-1467
 Project: Apache Arrow
  Issue Type: Bug
Reporter: SIDDHARTH TEOTIA
Assignee: SIDDHARTH TEOTIA


(1) 

allocateNew() in NullableValueVectors allocates extra memory for the validity 
vector of fixed-width vectors. Instead of doing bits.allocateNew(valueCount + 
1), we should simply do bits.allocateNew(valueCount). 

AFAIK, the only case where we need an additional valueCount is for the 
offsetVector and we already do that. Additional valueCount for the validity 
vector is not needed.

(2)

reset() method should call reset() on the underlying value vector to 
re-initialize the state (allocation monitor, reader index etc) and zero out the 
buffers. Right now we just reset the validity vector.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1444) BitVector.splitAndTransfer copies last byte incorrectly

2017-08-31 Thread SIDDHARTH TEOTIA (JIRA)

SIDDHARTH TEOTIA created ARROW-1444:
---

 Summary: BitVector.splitAndTransfer copies last byte incorrectly 
 Key: ARROW-1444
 URL: https://issues.apache.org/jira/browse/ARROW-1444
 Project: Apache Arrow
  Issue Type: Bug
Reporter: SIDDHARTH TEOTIA
Assignee: SIDDHARTH TEOTIA


This happens when length the start index is randomly positioned within a byte 
and length is not a multiple of 8



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (ARROW-1373) Implement getBuffer() methods at the ValueVector interface

2017-08-18 Thread SIDDHARTH TEOTIA (JIRA)

SIDDHARTH TEOTIA created ARROW-1373:
---

 Summary: Implement getBuffer() methods at the ValueVector 
interface
 Key: ARROW-1373
 URL: https://issues.apache.org/jira/browse/ARROW-1373
 Project: Apache Arrow
  Issue Type: Bug
Reporter: SIDDHARTH TEOTIA
Assignee: SIDDHARTH TEOTIA






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

1 2 >

1 - 100 of 110 matches

Mail list logo