[jira] [Created] (ARROW-5519) Add ORC JNI related components to travis CI

2019-06-06 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-5519:
-

 Summary: Add ORC JNI related components to travis CI
 Key: ARROW-5519
 URL: https://issues.apache.org/jira/browse/ARROW-5519
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5515) Ensure JVM to have sufficient capacity for large number of local reference

2019-06-05 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-5515:
-

 Summary: Ensure JVM to have sufficient capacity for large number 
of local reference
 Key: ARROW-5515
 URL: https://issues.apache.org/jira/browse/ARROW-5515
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++, Java
Reporter: Yurui Zhou
Assignee: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-29 Thread Yurui Zhou
Hey guys:

Currently all the comments has been resolved and all the builds and tests are 
passed.

Is there any other general comments regarding this changes?

Yurui
On 21 May 2019, 10:36 AM +0800, Yurui Zhou , wrote:
> Hi Micah:
>
> Thanks for the response. According to our benchmark, the cpp-orc is on 
> average 1% to 10% slower than the java-orc,
> While the on-heap to off-heap memory conversion overhead can easily outweigh 
> such a performance difference.
> And we are currently also working on some performance improvement patches to 
> cpp-orc to make sure it achieve at least the same performance as java-orc.
>
> Thanks
> Yurui
> On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote:
> > Hi Yurui,
> > This is cool, I will try to leave some comments tonight.
> >
> > Reading the JIRA it references the conversion from on-heap to off heap
> > memory being the performance issue. Now that Arrow Java can point at
> > arbitrary memory do you know the performance delta between java-orc and
> > cpp-orc? (I'm wondering if we should do something similar for parquet-cpp)
> >
> > Thanks,
> > Micah
> >
> > On Monday, May 20, 2019, Yurui Zhou  wrote:
> >
> > > Hi Guys:
> > >
> > > I just created a PR with WIP changes about adding JNI interface for
> > > reading orc files.
> > >
> > > All the major changes has been done and I would like some early feedback
> > > from the community.
> > >
> > > Feel free to take a look and leave your feedback.
> > > https://github.com/apache/arrow/pull/4348
> > >
> > > Some clean up and unit tests will be added up in follow up iterations.
> > >
> > > Thanks
> > > Yurui
> > >
> > >


Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Micah:

Thanks for the response. According to our benchmark, the cpp-orc is on average 
1% to 10% slower than the java-orc,
While the on-heap to off-heap memory conversion overhead can easily outweigh 
such a performance difference.
And we are currently also working on some performance improvement patches to 
cpp-orc to make sure it achieve at least the same performance as java-orc.

Thanks
Yurui
On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote:
> Hi Yurui,
> This is cool, I will try to leave some comments tonight.
>
> Reading the JIRA it references the conversion from on-heap to off heap
> memory being the performance issue. Now that Arrow Java can point at
> arbitrary memory do you know the performance delta between java-orc and
> cpp-orc? (I'm wondering if we should do something similar for parquet-cpp)
>
> Thanks,
> Micah
>
> On Monday, May 20, 2019, Yurui Zhou  wrote:
>
> > Hi Guys:
> >
> > I just created a PR with WIP changes about adding JNI interface for
> > reading orc files.
> >
> > All the major changes has been done and I would like some early feedback
> > from the community.
> >
> > Feel free to take a look and leave your feedback.
> > https://github.com/apache/arrow/pull/4348
> >
> > Some clean up and unit tests will be added up in follow up iterations.
> >
> > Thanks
> > Yurui
> >
> >


ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++

2019-05-20 Thread Yurui Zhou
Hi Guys:

I just created a PR with WIP changes about adding JNI interface for reading orc 
files.

All the major changes has been done and I would like some early feedback from 
the community.

Feel free to take a look and leave your feedback.
https://github.com/apache/arrow/pull/4348

Some clean up and unit tests will be added up in follow up iterations.

Thanks
Yurui



Re: Proper way to retrigger Travis CI builds

2019-04-25 Thread Yurui Zhou
Great to know! Thank you!

Yurui
On 25 Apr 2019, 8:33 PM +0800, Neville Dipale , wrote:
> To add here, sometimes builds for unrelated changes are caused by your
> branch being behind master. I've noticed that whenever I rebase my changes
> to latest master, I reliably only trigger the Rust jobs to run.
> Maybe that could also help non-Arrow commiters :)
>
> On Thu, 25 Apr 2019 at 14:11, Wes McKinney  wrote:
>
> > If you are an Arrow committer you can restart builds in the Travis CI
> > UI, but otherwise the method that Antoine indicated is the best option
> > for non-committers
> >
> > On Thu, Apr 25, 2019 at 4:51 AM Antoine Pitrou  wrote:
> > >
> > >
> > > Hi,
> > >
> > > I often do a force-push of identical contents, with a different
> > > changeset id:
> > >
> > > $ git commit -a --amend && git push --force
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > Le 25/04/2019 à 11:39, Yurui Zhou a écrit :
> > > > Hey guys:
> > > >
> > > > When submitting PR to master, I often run into Travis CI build
> > failures that are unrelated to my changes. I usually close and reopen the
> > PR to re-trigger the build. Just wondering is there any other way (like a
> > button) that allow me to re-trigger the falling builds without closing and
> > reopening my PR?
> > > >
> > > > Thanks
> > > > Yurui
> > > >
> >


Proper way to retrigger Travis CI builds

2019-04-25 Thread Yurui Zhou
Hey guys:

When submitting PR to master, I often run into Travis CI build failures that 
are unrelated to my changes. I usually close and reopen the PR to re-trigger 
the build. Just wondering is there any other way (like a button) that  allow me 
to re-trigger the falling builds without closing and reopening my PR?

Thanks
Yurui


[jira] [Created] (ARROW-5199) [Java] Add unsafe access method to ArrowBuf

2019-04-22 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-5199:
-

 Summary: [Java] Add unsafe access method to ArrowBuf
 Key: ARROW-5199
 URL: https://issues.apache.org/jira/browse/ARROW-5199
 Project: Apache Arrow
  Issue Type: Sub-task
Reporter: Yurui Zhou


Add unsafe access method to ArrowBuf to allow external user access underlying 
memory without boundary check



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5198) [Java] Add hasNull flag to Vectors

2019-04-22 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-5198:
-

 Summary: [Java] Add hasNull flag to Vectors
 Key: ARROW-5198
 URL: https://issues.apache.org/jira/browse/ARROW-5198
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Java
Reporter: Yurui Zhou
Assignee: Yurui Zhou


Add has null flag to Arrow Vector so that for vectors without any null, the 
null check process should be skipped.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5197) [Java] Improving Arrow Vector Reading performance

2019-04-22 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-5197:
-

 Summary: [Java] Improving Arrow Vector Reading performance
 Key: ARROW-5197
 URL: https://issues.apache.org/jira/browse/ARROW-5197
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Yurui Zhou
Assignee: Yurui Zhou


Currently the read interface of Java Arrow Vector is quite slow because the 
access operation has to go through validity bit check and boundary check before 
it can actually load the data. 

The Arrow Vector and ArrowBuf should expose unsafe methods for advanced users 
to directly access underlying data without null check and boundary check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4772) Provide new ORC adapter interface that allow user to specify row number

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4772:
-

 Summary: Provide new ORC adapter interface that allow user to 
specify row number
 Key: ARROW-4772
 URL: https://issues.apache.org/jira/browse/ARROW-4772
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4773) Enable copy free conversion for dictionary encoded string column

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4773:
-

 Summary: Enable copy free conversion for dictionary encoded string 
column
 Key: ARROW-4773
 URL: https://issues.apache.org/jira/browse/ARROW-4773
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4771) Enable copy free conversion for Composite type

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4771:
-

 Summary: Enable copy free conversion for Composite type
 Key: ARROW-4771
 URL: https://issues.apache.org/jira/browse/ARROW-4771
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4770) Enable copy free conversion for primitive types

2019-03-04 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4770:
-

 Summary: Enable copy free conversion for primitive types
 Key: ARROW-4770
 URL: https://issues.apache.org/jira/browse/ARROW-4770
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: C++
Reporter: Yurui Zhou






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4714) Providing JNI interface to Read ORC file via Arrow C++

2019-02-28 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4714:
-

 Summary: Providing JNI interface to Read ORC file via Arrow C++
 Key: ARROW-4714
 URL: https://issues.apache.org/jira/browse/ARROW-4714
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Java
Reporter: Yurui Zhou


Currently if we want to read data from ORC data into Arrow Record Batch in Java 
runtime, we needs to first use the ORC Java reader to load data into memory 
then convert it to Arrow RecordBatch.

However, since ORC Java Reader only read orc data into on heap memory, while 
Arrow Record only support off heap memory on Java, memory copy in unavoidable 
in conversion process.  In our internal benchmark, the conversion time can take 
up to 25% E2E latency when running tpch q1.

To workaround this overhead, a method is to enable the Java runtime directly 
reading data from native ORC c++ reader, which will load data directly into off 
heap memory and only pointer manipulation and schema ser/de would be involved 
in the conversion process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-4713) Improve C++ Orc Adapter performance and memory footprint

2019-02-28 Thread Yurui Zhou (JIRA)
Yurui Zhou created ARROW-4713:
-

 Summary: Improve C++ Orc Adapter performance and memory footprint
 Key: ARROW-4713
 URL: https://issues.apache.org/jira/browse/ARROW-4713
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Yurui Zhou


Currently the Arrow C++ provide a naive adapter implementation that allow user 
to read orc file to Arrow RecordBatch. However, this implementation have 
several drawbacks:
 * Inefficient conversion that incurs huge memcpy overhead
 ** currently the ORC adapter are performing byte to byte memcpy to move data 
to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC 
VectorBatch shares the same memory layout with Arrow in most of the Data Types
 * Huge memory footprint because the lack of TableReader implementation
 ** The ORC adapter currently only allow user to read data with the unit of 
stripe. However, as a columnar format with high compression ration, data read 
from a ORC stripe can potential takes over gigabytes of memory, which makes the 
ORC adapter not quite usable in production environment.

Here we propose a new ORC adapter implementation to fix the issue 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)