Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Micah Kornfield
Hi Dan, > > The paths are not unique across snapshots 1/2, but within each snapshot > they are. This is exactly the contract I was trying to determine. It seems like it could potentially be clarified in the specification (again apologies if I missed it). I opened a PR to try to add language

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Daniel Weeks
I think in the situation you're demonstrating, the manifests are separated across two separate snapshots. Here's an example: create table t1 (s string); insert into t1 values ('foo'); -- snapshot 0, manifest-list with 1 manifest pointing to file A (ADDED) insert into t1 values ('bar'); --

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Micah Kornfield
Hi Dan, Thanks for the quick reply. > For #2, the answer follows mostly because if the answer to #1 holds, then > yes the pairwise intersection of entries in the manifest files of a given > snapshot is empty. Just to be pedantic, even with unique file names. It seems one could construct a

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Daniel Weeks
Hey Micah, For #1, I don't believe spec clearly calls out that all data/delete files must be unique, but the requirements for cleanup would be violated in certain cases if you had the same file referenced in multiple manifests. In practice, the best way to ensure data correctness and metadata

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Owen O'Malley
At the stripe boundaries, the bytes on disk statistics are accurate. A stripe that is in flight, is going to be an estimate, because the dictionaries can't be compressed until the stripe is flushed. The memory usage will be a significant over estimate, because it includes buffers that are

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Dongjoon Hyun
The following is merged for Apache ORC 1.7.4. ORC-1123 Add `estimationMemory` method for writer According to the Apache ORC milestone, it will be released on May 15th. https://github.com/apache/orc/milestones Bests, Dongjoon. On 2022/03/04 13:11:15 Yiqun Zhang wrote: > Hi Openinx > > Thank

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Yiqun Zhang
Hi Openinx Thank you for initiating this discussion. I think we can get the `TypeDescription` from the writer and in the `TypeDescription` we know which types and more precisely the maximum length of the varchar/char. This will help us to estimate the average width. Also, I agree with your