date:20220304

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Micah Kornfield

Hi Dan, > > The paths are not unique across snapshots 1/2, but within each snapshot > they are. This is exactly the contract I was trying to determine. It seems like it could potentially be clarified in the specification (again apologies if I missed it). I opened a PR to try to add language arou

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Daniel Weeks

I think in the situation you're demonstrating, the manifests are separated across two separate snapshots. Here's an example: create table t1 (s string); insert into t1 values ('foo'); -- snapshot 0, manifest-list with 1 manifest pointing to file A (ADDED) insert into t1 values ('bar'); -- snapsh

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Micah Kornfield

Hi Dan, Thanks for the quick reply. > For #2, the answer follows mostly because if the answer to #1 holds, then > yes the pairwise intersection of entries in the manifest files of a given > snapshot is empty. Just to be pedantic, even with unique file names. It seems one could construct a snap

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

2022-03-04 Thread Daniel Weeks

Hey Micah, For #1, I don't believe spec clearly calls out that all data/delete files must be unique, but the requirements for cleanup would be violated in certain cases if you had the same file referenced in multiple manifests. In practice, the best way to ensure data correctness and metadata cons

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Owen O'Malley

At the stripe boundaries, the bytes on disk statistics are accurate. A stripe that is in flight, is going to be an estimate, because the dictionaries can't be compressed until the stripe is flushed. The memory usage will be a significant over estimate, because it includes buffers that are allocated

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Dongjoon Hyun

The following is merged for Apache ORC 1.7.4. ORC-1123 Add `estimationMemory` method for writer According to the Apache ORC milestone, it will be released on May 15th. https://github.com/apache/orc/milestones Bests, Dongjoon. On 2022/03/04 13:11:15 Yiqun Zhang wrote: > Hi Openinx > > Thank yo

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-04 Thread Yiqun Zhang

Hi Openinx Thank you for initiating this discussion. I think we can get the `TypeDescription` from the writer and in the `TypeDescription` we know which types and more precisely the maximum length of the varchar/char. This will help us to estimate the average width. Also, I agree with your sug

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

Re: [Specification] Uniqueness of Data file names and cardinality in a snapshot?

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

7 matches

Site Navigation

Mail list logo

Footer information