Re: C++ StringColumnStatisticsImpl::update Performance

2019-12-04 Thread Xiening Dai
> _stats.setHasNull(hasNull); > } > > -std::string getMinimum() const override { > +const std::string & getMinimum() const override { > if(hasMinimum()){ > return _stats.getMinimum(); > }else{ > @@ -1085,7 +1085,7 @@ namespace orc { >

Re: C++ StringColumnStatisticsImpl::update Performance

2019-12-03 Thread Xiening Dai
Thanks for sharing your findings. I think we should change the InternalStatisticsImpl get methods to return const reference instead. Such as - const T & getMaximum() const { return _maximum; } const T & getMinimum() const { return _minimum; } This way there won’t be any new string objec

The Orc magic string

2019-06-14 Thread Xiening Dai
Hi all, In Orc appending scenario, the append operation (including writing the additional data and the new footer) needs to be atomic. Otherwise if it failed in between, the file tail would be unrecognizable. Unfortunately not all file system can garantee atomic write. When failure does happen,

Re: C++ API seekToRow() performance.

2019-05-31 Thread Xiening Dai
Hi Shankar, This is a known issue. As far as I know, there are two issues here - 1. The reader doesn’t use row group index to skip unnecessary rows. Instead it read through every row until the cursor moves to the desired position. [1] 2. We could have skip the entire compression block when curre

Re: Review of the column encryption format changes

2019-04-25 Thread Xiening Dai
I plan to take a look this week. Thanks. > On Apr 23, 2019, at 10:08 AM, Owen O'Malley wrote: > > All, > Please take a look at the format changes for column encryption. > > https://github.com/apache/orc/pull/385 > > .. Owen

Re: [DISCUSS][C++] Add Support For INT/BYTE vector batch

2019-03-29 Thread Xiening Dai
Sounds like a good idea. Should we do the same for Java reader, Owen? > On Mar 28, 2019, at 8:50 PM, Yurui Zhou wrote: > > Hi guys: > > Currently ORC have LongVectorBatch as the only representation for primitive > integer types like boolean, byte, int and long. This is not very benefitial >

Re: Questions about C++ interface

2018-11-01 Thread Xiening Dai
1) You can find the public headers in c++/include/orc. All the classes and methods have good documentation in the code. You can also take a look at the sample c++ codes under tools/src, especially FileContents.cc and FileScan.cc. Both demonstrate the

Re: Orc v2 Ideas

2018-10-08 Thread Xiening Dai
> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom wrote: > > > >> On Oct 6, 2018, at 11:42 AM, Owen O'Malley wrote: >> >> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom wrote: >> >>> >>> Interesting idea. This could help some processors of the data. Also, if >>> the format has this, it would

Re: Orc v2 Ideas

2018-10-01 Thread Xiening Dai
Thanks Dain. My comments are inline. > On Oct 1, 2018, at 11:51 AM, Dain Sundstrom wrote: > > > >> On Sep 28, 2018, at 2:40 PM, Xiening Dai wrote: >> >> Hi all, >> >> While we are working on the new Orc v2 spec, I want to bounce some ideas

Orc v2 Ideas

2018-09-28 Thread Xiening Dai
Hi all, While we are working on the new Orc v2 spec, I want to bounce some ideas in this group. If we can get something concrete, I will open JIRAs to follow up. Some of these ideas were mentioned before in various discussion, but I just put them together in a list so people can comment and pro

Re:Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-19 Thread Xiening Dai
Send Date:Tue Sep 18 16:08:40 2018Recipients:Gopal Vijayaraghavan CC: , Xiening Dai Subject:Re: [Discussion] Base 128 variable integer encoding is not always goodGang,   As you correctly point out, some columns don't work well with RLE. Unfortunately, without being able to look at the data it i

Re: Arrow Support of Orc

2018-07-05 Thread Xiening Dai
compatibility.) > > Have you tried benchmarking and profiling the current adapters to see where > the bottlenecks are? > > .. Owen > > On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai wrote: > >> Hi all, >> >> Not sure if this has been brought up before - do we have p

Arrow Support of Orc

2018-07-03 Thread Xiening Dai
Hi all, Not sure if this has been brought up before - do we have plan to support Apache Arrow? Given its popularity and momentum recently, we might consider supporting Arrow format for Orc reader and writer. There’s an adapter for Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp

Apache Orc doc links are broken

2018-05-18 Thread Xiening Dai
See https://orc.apache.org/docs/. All the links are pointed to localhost and cannot be opened. Known issue?

Re: Zstd decoder support

2018-05-17 Thread Xiening Dai
Hi Dain, Do you have a roughly timeline regarding when the Java zstd compressor will be available? Thanks. > On May 7, 2018, at 12:34 PM, Dain Sundstrom wrote: > > The fixes are released in v0.11 > > -dain > >> On May 6, 2018, at 9:36 PM, Xiening Dai wrote: >>

[jira] [Created] (ORC-363) Enable zstd decompression in ORC Java reader

2018-05-11 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-363: --- Summary: Enable zstd decompression in ORC Java reader Key: ORC-363 URL: https://issues.apache.org/jira/browse/ORC-363 Project: ORC Issue Type: Bug

Re: [VOTE] Should we release ORC 1.5.0rc0?

2018-05-10 Thread Xiening Dai
+1 - build clean (make package; make test-out) - checksum and signatures are good. - rat checks passed (mvn apache-rat:check) > On May 10, 2018, at 12:03 PM, Deepak Majeti wrote: > > +1 > > - built from tar > - checked checksum and signature > - ran unit tests > - ran rat checks > > On Mon, M

Re: [VOTE] Should we release ORC 1.4.4rc0?

2018-05-09 Thread Xiening Dai
Hi Owen, What’s the release validation process? Gang and I could help on this. > On May 7, 2018, at 11:47 AM, Owen O'Malley wrote: > > Although I just started the ORC 1.5.0 vote, we have some users that want a > bug fix release for the ORC 1.4 line. > > Should we release the following artifac

Re: Zstd decoder support

2018-05-06 Thread Xiening Dai
Thanks for clarification. It makes sense to wait for your fixes. Thx. > On May 5, 2018, at 1:04 PM, Dain Sundstrom wrote: > > >> On May 5, 2018, at 11:46 AM, Xiening Dai wrote: >> >>>>> BTW we are about to do a release that fixes a bug with zstd. &

Re: Zstd decoder support

2018-05-05 Thread Xiening Dai
e compressor soon as we need it for our production >>> systems. >>> >>> BTW we are about to do a release that fixes a bug with zstd. >>> >>> -dain >>> >>>> On May 4, 2018, at 11:19 AM, Xiening Dai wrote: >>>> >>>> H

Zstd decoder support

2018-05-04 Thread Xiening Dai
Hi all, I think the major reason that we don’t support zstd compressor today is that there’s no native java library currently. But I do see a java decompressor in presto code base - https://github.com/prestodb/presto/blob/8f4e5bb9340890f01291ee1b777a1b2b921a90c4/presto-orc/src/main/java/com/fac

Question about incomingMask in ColumnReader::next

2018-04-10 Thread Xiening Dai
Hi all, In ColumnReader.cc line 108 - 113, there’s an incomingMask parameter that’s used to mask some of the rows in the batch when it’s specified. I am not sure why we would allow caller to specify a mask. If some of the rows are not desired, the caller can always filte

Re: ORC double encoding optimization proposal

2018-03-28 Thread Xiening Dai
So we could modify my #2 proposal to be sensitive to rle and compression chunks. If at the end of the row group, we wait until the rle and compression chunks close and interleave the streams. That means that for a column with three streams and two row groups, we could something like: I think y

Re: ORC double encoding optimization proposal

2018-03-26 Thread Xiening Dai
This is very similar to what we do with our modified reader - we have a small stream threshold and save all small streams right after index streams, and use a single IO to buffer them when load a stripe. We thought about sorting streams by size, but didn’t want to break apart streams that belong

Re: ORC double encoding optimization proposal

2018-03-26 Thread Xiening Dai
Where does the 2x IO drop come from? Based on Cheng Xu’s data, Split + Zstd has ~15% improvement over PlainV2 + Zstd in terms of the file size. If I understand correctly, the total number of IO reads are almost the same, but Split will need an additional seek for each read. The random IOPS woul

Re: ORC double encoding optimization proposal

2018-03-25 Thread Xiening Dai
Hi Gopal, ORC spec doesn’t guarantee streams belong to the same column are stored together. Even if that’s guaranteed, there are reasons why we cannot read adjacent streams with one single IO - 1. Streams can be large. Reading the whole stream(s) will add unnecessary memory pressure. 2. Under

Re: ORC double encoding optimization proposal

2018-03-25 Thread Xiening Dai
Interesting discussion. Thanks Gopal and Cheng. One of the drawback I can see with Split is the fragmented IO pattern. Since Split creates two separated streams, reading one data batch will need an additional seek in order to reconstruct the column data. This creates extra burden for clusters w

Re: RLEv1 versus RLEv2

2018-02-01 Thread Xiening Dai
prepared in 2015 after which more perf > improvements went in. I don't think anyone benchmarked RLEv1 vs RLEv2 after > the initial perf tests. > > > Thanks and Regards, > Prasanth Jayachandran > > > On Wed, Jan 31, 2018 at 8:27 PM, Xiening Dai wrote: > >&

RLEv1 versus RLEv2

2018-01-31 Thread Xiening Dai
Hi, I am evaluating if we should work as a high priority to add RLEv2 support in Orc C++ writer. I wonder if anyone has any performance data to share regarding RLEv2. I believe some related tests and evaluation must have been done when RLEv2 was introduced. Would appreciate if someone can share

[jira] [Created] (ORC-290) [C++] Update Readme to include C++ writer info

2018-01-18 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-290: --- Summary: [C++] Update Readme to include C++ writer info Key: ORC-290 URL: https://issues.apache.org/jira/browse/ORC-290 Project: ORC Issue Type: Bug

Re: ORC magic

2017-12-15 Thread Xiening Dai
Hi Deepak, ORC C++ writer does write “ORC” magic at the beginning of file. But the reader is not verify it when open the file (same for Java reader as far as I can tell). But there’s probably a reason for that - since the reader already verifies the postscript at file tail it’s not necessary to

Re: ORC magic

2017-12-14 Thread Xiening Dai
It looks like our reader implementation (both java and c++) doesn’t verify file begins with “ORC”. > On Dec 14, 2017, at 7:16 PM, Deepak Majeti wrote: > > Hi Dain, > > The ORC spec requires that a file start with "ORC". > > From https://orc.apache.org/docs/file-tail.html > > "The file is bro

[jira] [Created] (ORC-262) Support async prefetch in Orc reader

2017-11-07 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-262: --- Summary: Support async prefetch in Orc reader Key: ORC-262 URL: https://issues.apache.org/jira/browse/ORC-262 Project: ORC Issue Type: Improvement

[jira] [Created] (ORC-226) Support getWriterId in c++ reader interface

2017-08-09 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-226: --- Summary: Support getWriterId in c++ reader interface Key: ORC-226 URL: https://issues.apache.org/jira/browse/ORC-226 Project: ORC Issue Type: Sub-task

[jira] [Created] (ORC-205) Include writer timezone in stripe footer

2017-06-21 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-205: --- Summary: Include writer timezone in stripe footer Key: ORC-205 URL: https://issues.apache.org/jira/browse/ORC-205 Project: ORC Issue Type: Sub-task

[jira] [Created] (ORC-191) RLE v1 encoder

2017-05-16 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-191: --- Summary: RLE v1 encoder Key: ORC-191 URL: https://issues.apache.org/jira/browse/ORC-191 Project: ORC Issue Type: Sub-task Reporter: Xiening Dai

[jira] [Created] (ORC-192) Zlib compression stream

2017-05-16 Thread Xiening Dai (JIRA)
Xiening Dai created ORC-192: --- Summary: Zlib compression stream Key: ORC-192 URL: https://issues.apache.org/jira/browse/ORC-192 Project: ORC Issue Type: Sub-task Reporter: Xiening Dai

ORC C++ Writer

2016-03-08 Thread Xiening Dai
Hi, Is there any plan for adding support of C++ ORC writer? We consider using the C++ library for reading and writing ORC format file. But currently only reader is available. Thanks.

ORC writer in C++?

2016-02-22 Thread Xiening Dai
Hi all, Is there a C++ ORC file writer available today? I can only see column printer but not file writer. I am using https://git-wip-us.apache.org/repos/asf/orc.git. Thanks!