[jira] [Created] (ARROW-5519) Add ORC JNI related components to travis CI
Yurui Zhou created ARROW-5519: - Summary: Add ORC JNI related components to travis CI Key: ARROW-5519 URL: https://issues.apache.org/jira/browse/ARROW-5519 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5515) Ensure JVM to have sufficient capacity for large number of local reference
Yurui Zhou created ARROW-5515: - Summary: Ensure JVM to have sufficient capacity for large number of local reference Key: ARROW-5515 URL: https://issues.apache.org/jira/browse/ARROW-5515 Project: Apache Arrow Issue Type: Sub-task Components: C++, Java Reporter: Yurui Zhou Assignee: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++
Hey guys: Currently all the comments has been resolved and all the builds and tests are passed. Is there any other general comments regarding this changes? Yurui On 21 May 2019, 10:36 AM +0800, Yurui Zhou , wrote: > Hi Micah: > > Thanks for the response. According to our benchmark, the cpp-orc is on > average 1% to 10% slower than the java-orc, > While the on-heap to off-heap memory conversion overhead can easily outweigh > such a performance difference. > And we are currently also working on some performance improvement patches to > cpp-orc to make sure it achieve at least the same performance as java-orc. > > Thanks > Yurui > On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote: > > Hi Yurui, > > This is cool, I will try to leave some comments tonight. > > > > Reading the JIRA it references the conversion from on-heap to off heap > > memory being the performance issue. Now that Arrow Java can point at > > arbitrary memory do you know the performance delta between java-orc and > > cpp-orc? (I'm wondering if we should do something similar for parquet-cpp) > > > > Thanks, > > Micah > > > > On Monday, May 20, 2019, Yurui Zhou wrote: > > > > > Hi Guys: > > > > > > I just created a PR with WIP changes about adding JNI interface for > > > reading orc files. > > > > > > All the major changes has been done and I would like some early feedback > > > from the community. > > > > > > Feel free to take a look and leave your feedback. > > > https://github.com/apache/arrow/pull/4348 > > > > > > Some clean up and unit tests will be added up in follow up iterations. > > > > > > Thanks > > > Yurui > > > > > >
Re: ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++
Hi Micah: Thanks for the response. According to our benchmark, the cpp-orc is on average 1% to 10% slower than the java-orc, While the on-heap to off-heap memory conversion overhead can easily outweigh such a performance difference. And we are currently also working on some performance improvement patches to cpp-orc to make sure it achieve at least the same performance as java-orc. Thanks Yurui On 20 May 2019, 9:22 PM +0800, Micah Kornfield , wrote: > Hi Yurui, > This is cool, I will try to leave some comments tonight. > > Reading the JIRA it references the conversion from on-heap to off heap > memory being the performance issue. Now that Arrow Java can point at > arbitrary memory do you know the performance delta between java-orc and > cpp-orc? (I'm wondering if we should do something similar for parquet-cpp) > > Thanks, > Micah > > On Monday, May 20, 2019, Yurui Zhou wrote: > > > Hi Guys: > > > > I just created a PR with WIP changes about adding JNI interface for > > reading orc files. > > > > All the major changes has been done and I would like some early feedback > > from the community. > > > > Feel free to take a look and leave your feedback. > > https://github.com/apache/arrow/pull/4348 > > > > Some clean up and unit tests will be added up in follow up iterations. > > > > Thanks > > Yurui > > > >
ARROW-4714: Providing JNI interface to Read ORC file via Arrow C++
Hi Guys: I just created a PR with WIP changes about adding JNI interface for reading orc files. All the major changes has been done and I would like some early feedback from the community. Feel free to take a look and leave your feedback. https://github.com/apache/arrow/pull/4348 Some clean up and unit tests will be added up in follow up iterations. Thanks Yurui
Re: Proper way to retrigger Travis CI builds
Great to know! Thank you! Yurui On 25 Apr 2019, 8:33 PM +0800, Neville Dipale , wrote: > To add here, sometimes builds for unrelated changes are caused by your > branch being behind master. I've noticed that whenever I rebase my changes > to latest master, I reliably only trigger the Rust jobs to run. > Maybe that could also help non-Arrow commiters :) > > On Thu, 25 Apr 2019 at 14:11, Wes McKinney wrote: > > > If you are an Arrow committer you can restart builds in the Travis CI > > UI, but otherwise the method that Antoine indicated is the best option > > for non-committers > > > > On Thu, Apr 25, 2019 at 4:51 AM Antoine Pitrou wrote: > > > > > > > > > Hi, > > > > > > I often do a force-push of identical contents, with a different > > > changeset id: > > > > > > $ git commit -a --amend && git push --force > > > > > > Regards > > > > > > Antoine. > > > > > > > > > Le 25/04/2019 à 11:39, Yurui Zhou a écrit : > > > > Hey guys: > > > > > > > > When submitting PR to master, I often run into Travis CI build > > failures that are unrelated to my changes. I usually close and reopen the > > PR to re-trigger the build. Just wondering is there any other way (like a > > button) that allow me to re-trigger the falling builds without closing and > > reopening my PR? > > > > > > > > Thanks > > > > Yurui > > > > > >
Proper way to retrigger Travis CI builds
Hey guys: When submitting PR to master, I often run into Travis CI build failures that are unrelated to my changes. I usually close and reopen the PR to re-trigger the build. Just wondering is there any other way (like a button) that allow me to re-trigger the falling builds without closing and reopening my PR? Thanks Yurui
[jira] [Created] (ARROW-5199) [Java] Add unsafe access method to ArrowBuf
Yurui Zhou created ARROW-5199: - Summary: [Java] Add unsafe access method to ArrowBuf Key: ARROW-5199 URL: https://issues.apache.org/jira/browse/ARROW-5199 Project: Apache Arrow Issue Type: Sub-task Reporter: Yurui Zhou Add unsafe access method to ArrowBuf to allow external user access underlying memory without boundary check -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5198) [Java] Add hasNull flag to Vectors
Yurui Zhou created ARROW-5198: - Summary: [Java] Add hasNull flag to Vectors Key: ARROW-5198 URL: https://issues.apache.org/jira/browse/ARROW-5198 Project: Apache Arrow Issue Type: Sub-task Components: Java Reporter: Yurui Zhou Assignee: Yurui Zhou Add has null flag to Arrow Vector so that for vectors without any null, the null check process should be skipped. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5197) [Java] Improving Arrow Vector Reading performance
Yurui Zhou created ARROW-5197: - Summary: [Java] Improving Arrow Vector Reading performance Key: ARROW-5197 URL: https://issues.apache.org/jira/browse/ARROW-5197 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Yurui Zhou Assignee: Yurui Zhou Currently the read interface of Java Arrow Vector is quite slow because the access operation has to go through validity bit check and boundary check before it can actually load the data. The Arrow Vector and ArrowBuf should expose unsafe methods for advanced users to directly access underlying data without null check and boundary check. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4772) Provide new ORC adapter interface that allow user to specify row number
Yurui Zhou created ARROW-4772: - Summary: Provide new ORC adapter interface that allow user to specify row number Key: ARROW-4772 URL: https://issues.apache.org/jira/browse/ARROW-4772 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4773) Enable copy free conversion for dictionary encoded string column
Yurui Zhou created ARROW-4773: - Summary: Enable copy free conversion for dictionary encoded string column Key: ARROW-4773 URL: https://issues.apache.org/jira/browse/ARROW-4773 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4771) Enable copy free conversion for Composite type
Yurui Zhou created ARROW-4771: - Summary: Enable copy free conversion for Composite type Key: ARROW-4771 URL: https://issues.apache.org/jira/browse/ARROW-4771 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4770) Enable copy free conversion for primitive types
Yurui Zhou created ARROW-4770: - Summary: Enable copy free conversion for primitive types Key: ARROW-4770 URL: https://issues.apache.org/jira/browse/ARROW-4770 Project: Apache Arrow Issue Type: Sub-task Components: C++ Reporter: Yurui Zhou -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4714) Providing JNI interface to Read ORC file via Arrow C++
Yurui Zhou created ARROW-4714: - Summary: Providing JNI interface to Read ORC file via Arrow C++ Key: ARROW-4714 URL: https://issues.apache.org/jira/browse/ARROW-4714 Project: Apache Arrow Issue Type: Improvement Components: C++, Java Reporter: Yurui Zhou Currently if we want to read data from ORC data into Arrow Record Batch in Java runtime, we needs to first use the ORC Java reader to load data into memory then convert it to Arrow RecordBatch. However, since ORC Java Reader only read orc data into on heap memory, while Arrow Record only support off heap memory on Java, memory copy in unavoidable in conversion process. In our internal benchmark, the conversion time can take up to 25% E2E latency when running tpch q1. To workaround this overhead, a method is to enable the Java runtime directly reading data from native ORC c++ reader, which will load data directly into off heap memory and only pointer manipulation and schema ser/de would be involved in the conversion process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-4713) Improve C++ Orc Adapter performance and memory footprint
Yurui Zhou created ARROW-4713: - Summary: Improve C++ Orc Adapter performance and memory footprint Key: ARROW-4713 URL: https://issues.apache.org/jira/browse/ARROW-4713 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Yurui Zhou Currently the Arrow C++ provide a naive adapter implementation that allow user to read orc file to Arrow RecordBatch. However, this implementation have several drawbacks: * Inefficient conversion that incurs huge memcpy overhead ** currently the ORC adapter are performing byte to byte memcpy to move data to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC VectorBatch shares the same memory layout with Arrow in most of the Data Types * Huge memory footprint because the lack of TableReader implementation ** The ORC adapter currently only allow user to read data with the unit of stripe. However, as a columnar format with high compression ration, data read from a ORC stripe can potential takes over gigabytes of memory, which makes the ORC adapter not quite usable in production environment. Here we propose a new ORC adapter implementation to fix the issue -- This message was sent by Atlassian JIRA (v7.6.3#76005)