Re: The Orc magic string

2019-06-15 Thread Dain Sundstrom
Is this expected behavior of ORC acid writers? If so, is it documented somewhere? -dain Dain Sundstrom Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io) > On Jun 14, 2019, at 6:17 PM, Owen O'Malley wrote: > > The hive acid format uses a

Re: Pluggable index for ORC

2019-06-07 Thread Dain Sundstrom
It would be nice if the there were some reserved space in the enums for experimentations like this. -dain Dain Sundstrom Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io) > On Jun 6, 2019, at 4:14 AM, Praveen Krishna > wrote: >

Re: WriterOptions.writerVersion(version)?

2019-03-02 Thread Dain Sundstrom
After a bit more investigation, it looks like this was a regression only present in Hive 2.0 to 2.2. Dain Sundstrom Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io) > On Mar 1, 2019, at 5:58 PM, Dain Sundstrom wrote: > > Thanks Owen,

Re: WriterOptions.writerVersion(version)?

2019-03-01 Thread Dain Sundstrom
rs ago. ---- Dain Sundstrom Co-founder @ Presto Software Foundation, Co-creator of Presto (https://prestosql.io) > On Mar 1, 2019, at 3:19 PM, Owen O'Malley wrote: > > The goal of WriterVersion is to record changes to the writer software so > that the readers can cope with unknown bugs.

Expected type coercions

2019-01-15 Thread Dain Sundstrom
Is there any documentation on which type coercions should be supported when the partition schema does not match the file schema? -dain

Re: Orc v2 Ideas

2018-10-09 Thread Dain Sundstrom
> On Oct 8, 2018, at 5:19 PM, Xiening Dai wrote: >> On Oct 7, 2018, at 6:42 AM, Dain Sundstrom wrote: >>> On Oct 6, 2018, at 11:42 AM, Owen O'Malley wrote: >>> >>> On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom wrote: >>>>>> * Breakin

Re: Orc v2 Ideas

2018-10-06 Thread Dain Sundstrom
> On Oct 6, 2018, at 11:42 AM, Owen O'Malley wrote: > > On Mon, Oct 1, 2018 at 3:56 PM Dain Sundstrom wrote: > >> >> Interesting idea. This could help some processors of the data. Also, if >> the format has this, it would be good to support "clustered

Re: Orc v2 Ideas

2018-10-01 Thread Dain Sundstrom
can be easily turned into a >> key range. For example “WHERE id > 0 and id <= 100” will be translated into >> range (0, 100], and this key range can be passed down all the way to the >> Orc reader. Then we only need to load the corresponding row groups that >> covers thi

Re: Orc v2 Ideas

2018-10-01 Thread Dain Sundstrom
lock compression and the data encodings, can have dramatic effects on the format. Can we consider limiting the compression to LZ4 and ZSTD (or may be just ZSTD), and then design encodings that play well with them? Also, ZSTD can have pre-trained "dictionary" that might help with specific encodings…. Just a thought. -dain

[jira] [Created] (ORC-369) Seek to end of BitFieldReader causes EOFException

2018-05-29 Thread Dain Sundstrom (JIRA)
Dain Sundstrom created ORC-369: -- Summary: Seek to end of BitFieldReader causes EOFException Key: ORC-369 URL: https://issues.apache.org/jira/browse/ORC-369 Project: ORC Issue Type: Bug

Re: Zstd decoder support

2018-05-17 Thread Dain Sundstrom
Our expectation is maybe in a quarter. -dain > On May 17, 2018, at 11:42 AM, Xiening Dai <xndai@live.com> wrote: > > Hi Dain, > > Do you have a roughly timeline regarding when the Java zstd compressor will > be available? Thanks. > > >> On May 7,

Re: Zstd decoder support

2018-05-07 Thread Dain Sundstrom
The fixes are released in v0.11 -dain > On May 6, 2018, at 9:36 PM, Xiening Dai <xndai@live.com> wrote: > > Thanks for clarification. It makes sense to wait for your fixes. Thx. > >> On May 5, 2018, at 1:04 PM, Dain Sundstrom <d...@iq80.com> wrote: >>

Re: Zstd decoder support

2018-05-05 Thread Dain Sundstrom
tly. We missed this one because the default native implementation was not adding checksums so the code wasn’t actually being tested. -dain

Re: Zstd decoder support

2018-05-04 Thread Dain Sundstrom
0.11 is released. -dain Sent from my iPhone > On May 4, 2018, at 1:41 PM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > I just upgraded ORC to use aircompressor 0.10. I assume we'll want to move > to 0.11 before we use zstd? > > .. Owen > >> On Fri, May 4,

Re: Zstd decoder support

2018-05-04 Thread Dain Sundstrom
. -dain > On May 4, 2018, at 11:19 AM, Xiening Dai <xndai@live.com> wrote: > > Hi all, > > I think the major reason that we don’t support zstd compressor today is that > there’s no native java library currently. But I do see a java decompressor in > pr

Re: Alternatives to JMH for the benchmarking code

2018-04-20 Thread Dain Sundstrom
I’m pretty sure that JMH has class path exception, just like the JVM, so you can link to it, but can't ship it. -dain > On Apr 20, 2018, at 9:09 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > On Thu, Apr 19, 2018 at 1:20 PM, Dain Sundstrom <d...@iq80.com> wrote

Re: Alternatives to JMH for the benchmarking code

2018-04-19 Thread Dain Sundstrom
I don’t think there is anything like JMH, or any team that understands Java micro benchmarking as well. Have you tried asking them to open source the APIs under a better license so you can code against it. -dain > On Apr 18, 2018, at 8:52 AM, Owen O'Malley <owen.omal...@gmail.com&

[GitHub] orc pull request #245: ORC-161: Proposal for new decimal encodings and stati...

2018-04-12 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/245#discussion_r181164570 --- Diff: site/_docs/encodings.md --- @@ -109,10 +109,20 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE Decimal was introduced in Hive 0.11

Re: ORC double encoding optimization proposal

2018-03-31 Thread Dain Sundstrom
e IO optimizer ends up reading the full streams anyway (e.g., a seek on a disk is about as expensive as reading ~1MiB of data so you coalesce reads with a gap less than ~1MiB to avoid the extra seek). -dain

Re: ORC double encoding optimization proposal

2018-03-26 Thread Dain Sundstrom
) number columns mixed with (large) string columns. If you only want the numbers, you end up doing a lot of IOs (because of the large string comus in the middle), and with this model you have a higher chance of getting a shared IO. -dain > On Mar 26, 2018, at 4:23 PM, Owen O'Malley <owe

ORC magic

2017-12-14 Thread Dain Sundstrom
Does the ORC spec require that a file start with “ORC”? -dain

[GitHub] orc issue #169: [WIP] ORC-203 Modify the StringStatistics to trim the minimu...

2017-09-21 Thread dain
Github user dain commented on the issue: https://github.com/apache/orc/pull/169 @xndai if the min or max happens to be a multi megabyte value it can be really expensive for the reader. Additionally, for filtering the first few bytes are the most valuable (they establish the range). ---

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

2017-08-29 Thread dain
Github user dain commented on the issue: https://github.com/apache/orc/pull/163 Also this is a backwards incompatible change, so we would, at the very least, need to do the trick where it is disabled by default in the writer until the reader is rolled out everywhere. --- If your

[GitHub] orc issue #163: ORC-162. Handle 0 byte files as empty ORC files.

2017-08-29 Thread dain
Github user dain commented on the issue: https://github.com/apache/orc/pull/163 We were considering doing this internally and then we ran into a production bug where files got truncated to zero bytes. Since empty files are illegal we could find all of the effected partitions easily

Re: [DISCUSS] ORC 2.0

2017-08-04 Thread Dain Sundstrom
t; unusable. Before throwing that switch, he would get none of the benefits >> of ORC 2.0. Is this summary correct? >> > > Yes, exactly. I think the important part is not to change the APIs so tools can be updated by just upgrading the dep. -dain

Re: [DISCUSS] ORC 2.0

2017-08-04 Thread Dain Sundstrom
encodings, we should pick encodings that play well with vectorization which is coming in Java 10 (Java 9 also has vastly improved auto vectorization). -dain > On Aug 4, 2017, at 9:29 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > All, > We've started the process of upda

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-17 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122580567 --- Diff: java/core/src/java/org/apache/orc/impl/OrcTail.java --- @@ -70,8 +70,11 @@ public long getFileModificationTime() { public

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-17 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122575288 --- Diff: proto/orc_proto.proto --- @@ -221,15 +227,32 @@ message PostScript { // [0, 12] = Hive 0.12 repeated uint32 version = 4 [packed = true

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-17 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122575252 --- Diff: java/core/src/java/org/apache/orc/impl/OrcTail.java --- @@ -70,8 +70,11 @@ public long getFileModificationTime() { public

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-16 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122521442 --- Diff: proto/orc_proto.proto --- @@ -221,15 +227,29 @@ message PostScript { // [0, 12] = Hive 0.12 repeated uint32 version = 4 [packed = true

Documentations issues

2017-06-16 Thread Dain Sundstrom
Recently I have been working on a custom writer for Presto and during this I kept notes on sections of the documentation that might have problems. Some of these may have already been addressed: ## Compression see https://orc.apache.org/docs/compression.html I think the hex sequence for 10

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-16 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122507019 --- Diff: proto/orc_proto.proto --- @@ -221,15 +227,29 @@ message PostScript { // [0, 12] = Hive 0.12 repeated uint32 version = 4 [packed = true

[GitHub] orc pull request #132: ORC-202. Add writer implementation enum to file forma...

2017-06-16 Thread dain
Github user dain commented on a diff in the pull request: https://github.com/apache/orc/pull/132#discussion_r122506293 --- Diff: java/core/src/java/org/apache/orc/OrcFile.java --- @@ -108,66 +108,118 @@ public int getMinor() { } } + public enum

What did ORC writer v3 fix?

2017-06-06 Thread Dain Sundstrom
On reading the HIVE-12055 referenced from the docs for writer version 3, I’m not sure what was fixed. Does any one remember? -dain

Re: String stats requirements?

2017-06-06 Thread Dain Sundstrom
/6072e3aed88d9246e1130abadf3c15a88e975b4e#diff-340d190f994d92658b24aae1edf610b3 Is writer version "1 = HIVE-8732 fixed” after 0.14? If so I can update my reader to detect this. -dain > On Jun 6, 2017, at 3:36 PM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > On Tue, Jun 6, 2017 at 3:02 P

Re: "For dictionary encodings the dictionary is sorted"

2017-06-06 Thread Dain Sundstrom
u want to transition back to Java Strings, we would need to be a bit smarter. -dain > On Jun 6, 2017, at 3:39 PM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > I'm confused. TimestampStatistics uses integers not strings. > > .. Owen > > On Mon, Jun 5, 2017 at 9:53

String stats requirements?

2017-06-06 Thread Dain Sundstrom
at the first surrogate pair, so the value is slightly smaller than the min or larger than the max, and still a valid UTF-8 sequence. Thoughts? -dain

Re: "For dictionary encodings the dictionary is sorted"

2017-06-05 Thread Dain Sundstrom
> On Dec 12, 2016, at 4:48 PM, Dain Sundstrom <d...@iq80.com> wrote: > On Dec 12, 2016, at 4:36 PM, Owen O'Malley <omal...@apache.org> wrote: >>> I think this should also be documented in the statistics section which >> also uses UTF-16 BE, which is at lea

[GitHub] orc issue #78: ORC-128. Add getStatistics to Writer API

2017-01-06 Thread dain
Github user dain commented on the issue: https://github.com/apache/orc/pull/78 Just curious, how do you plan on using this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] orc issue #76: ORC-119. Create an API to separate out layout from the writer...

2017-01-04 Thread dain
Github user dain commented on the issue: https://github.com/apache/orc/pull/76 Can you describe why you want to make this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

Re: Bloom filter hash broken

2016-09-09 Thread Dain Sundstrom
bloom filters from these files. -dain

Re: Bloom filter hash broken

2016-09-08 Thread Dain Sundstrom
Sounds good to me. Should we add a version field to the BLOOM_FILTER_UTF8 to deal with any future problems? One other thought, in the protobuf definition I think it would be more efficient to have the bitset encoded as a byte[] to avoid the boxed long array. -dain > On Sep 8, 2016, at 3