[jira] [Created] (ORC-629) PPD: Floating point NaN is not transitive across comparisons

2020-05-06 Thread Gopal Vijayaraghavan (Jira)
Gopal Vijayaraghavan created ORC-629: Summary: PPD: Floating point NaN is not transitive across comparisons Key: ORC-629 URL: https://issues.apache.org/jira/browse/ORC-629 Project: ORC

[jira] [Created] (ORC-570) FS: ReaderOptions.filesystem should also accept a lazy Supplier

2019-11-14 Thread Gopal Vijayaraghavan (Jira)
Gopal Vijayaraghavan created ORC-570: Summary: FS: ReaderOptions.filesystem should also accept a lazy Supplier Key: ORC-570 URL: https://issues.apache.org/jira/browse/ORC-570 Project: ORC

Re: How to make ORC use libz.so instead of libzip.so

2019-02-07 Thread Gopal Vijayaraghavan
>We are conducting a project involving replacing (Linux) system's >libz.so with our own hardware based implementation, but this requires us > to >replace libzip.so with our own so that small zip processing doesn't go > through >hardware, as hardware actually cannot process

Re: Orc v2 Ideas

2018-10-09 Thread Gopal Vijayaraghavan
> How small are you trying to make the stripes? I ask because all of the > above should be small, so if they are dominating, I would assume the stripe > is tiny or the compression really worked well. I'm not in favour of stripelets for seek reasons, because reading a single column from a

Re: Orc v2 Ideas

2018-10-09 Thread Gopal Vijayaraghavan
>Zstd with particular settings doesn’t work well on one particular > non-public dataset after it is encoded by RLE. >I’ve suggested that you try tuning the zstd compression to find a set of > parameters that work well with RLE. Take a look at how we tune the zlib > compression based

Re: [VOTE] Should we release ORC 1.5.3rc0?

2018-09-20 Thread Gopal Vijayaraghavan
Hi, +1 - verified keys, signature, rebuilt Hive master against this build & ran a few queries on LLAP. Cheers, Gopal On 9/20/18, 4:26 PM, "Owen O'Malley" wrote: All, Should we release the following artifacts as ORC 1.5.3? tar: http://home.apache.org/~omalley/orc-1.5.3/

Re: [Discussion] Base 128 variable integer encoding is not always good

2018-09-18 Thread Gopal Vijayaraghavan
Hi, > From above observation, we find that it is better to disable LEB128 encoding > while zstd is used. You can enable file size optimizations (automatically recommend better layouts for compression) when "orc.encoding.strategy"="COMPRESSION" There are a bunch of bitpacking loops that's

Re: [VOTE] Should we release ORC 1.5.2rc0?

2018-06-25 Thread Gopal Vijayaraghavan
Verified signatures against dist KEYS, checksums. Built Hive3.0 against 1.5.2 & everything looks good. +1 binding. Cheers, Gopal On 6/25/18, 4:43 PM, "Prasanth Jayachandran" wrote: Oops. My bad. Here is the correct link http://home.apache.org/~prasanthj/orc-1.5.2rc0/

Re: [VOTE] Should we release ORC 1.5.0rc0?

2018-05-14 Thread Gopal Vijayaraghavan
Hi, +1 Package builds clean & tested against HIVE-19465. Cheers, Gopal On 5/14/18, 9:54 AM, "Owen O'Malley" wrote: *Ping* I need one more PMC vote, please. :) On Thu, May 10, 2018 at 3:18 PM, Xiening Dai wrote: >

Re: [Proposal] New decimal encoding

2018-04-10 Thread Gopal Vijayaraghavan
Hi, I agree with your analysis about Decimals. Something similar has already gone into patch-available previously, but held back https://issues.apache.org/jira/browse/ORC-209 This is somewhat stuck behind the Vector type system evolving support for this

Re: ORC double encoding optimization proposal

2018-03-26 Thread Gopal Vijayaraghavan
> the bad thing is that we still have TWO encodings to discuss. Two is exactly what we need, not five - from the existing ORC configs hive.exec.orc.encoding.strategy=[SPEED, COMPRESSION]; FLIP8 was my original suggestion to Teddy from the byteuniq UDF runs, though the regressions in

Re: ORC double encoding optimization proposal

2018-03-26 Thread Gopal Vijayaraghavan
>2. Under seek or predicate pushdown scenario, there’s no need to load the > entire stream. Yes, that is a valid scenario where the reader reads partial-streams & causes random IO. The current double encoding is actually 2 streams today & will continue to use 2 streams for the FLIP

Re: ORC double encoding optimization proposal

2018-03-25 Thread Gopal Vijayaraghavan
Hi, > Since Split creates two separated streams, reading one data batch will need > an additional seek in order to reconstruct the column data If you are seeing a seek like that, we've messed up something else higher up in the pipeline & that can be fixed. ORC columnar reads only do random

Re: ORC double encoding optimization proposal

2018-03-19 Thread Gopal Vijayaraghavan
> existing work [1] from Teddy Choi and Owen O'Malley with some new compression > codec (e.g. ZSTD and Brotli), we proposed to prompt FLIP as the default > encoding for ORC double type to move this feature forwards. Since we're discussing these, I'm going to summarize my existing notes on this,

Re: Thoughts on Acid reader

2017-09-14 Thread Gopal Vijayaraghavan
> For performance reasons, you prefer the second option that I rejected > where users give a file and the system finds the deletes from there. I can > buy that. That's simpler at least to understand and debug, the logs from ORC alone are enough to find consistency issues. The rest of the

Re: [DISCUSS] ORC 2.0

2017-08-11 Thread Gopal Vijayaraghavan
Hi, > My intention is that we can iterate on the UNSTABLE-PRE-2.0 format without > cross-version compatibility. It will only be used for developer testing. Sounds good - I tested Hive can communicate this to ORC correctly. set hive.exec.orc.write.format="UNSTABLE-PRE-2.0"; offers a very

Re: String stats requirements?

2017-06-06 Thread Gopal Vijayaraghavan
> I agree that we want to be able to trim the values. I've seen cases where > the String is huge (~100k) and makes the StringStatistics huge. I'd propose > that we do something like: The only concrete consumer of this data outside of ORC readers is probably partial scan computation of

Re: ORC Stripe Skip Using Stripe Level Index

2017-01-24 Thread Gopal Vijayaraghavan
>I can see that row indices are being used to select only rowgroups that >satisfy a search predicate in … > But, I cannot find where and if the stripe level indices are being used?