Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-07 Thread OpenInx
Thanks Dongjoon & Yiqun for the quick PR for adding the `estimateMemory` API. Also thanks Yiqun & Owen for your points, I think you are right. So a more accurate estimation method may be to multiply batch.size by the average width of the data type, and then multiply it by the compression rate, w

Re: [External] Re: Continuing the Secondary Index Discussion

2022-03-07 Thread Piotr Findeisen
Hi Zaicheng, thanks for following up on this. I'm certainly interested. The proposed time doesn't work for me though, I'm in the CET time zone. Best, PF On Sat, Mar 5, 2022 at 9:33 AM Zaicheng Wang wrote: > Hi dev folks, > > As discussed in the sync >

RE: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Mayur Srivastava
A few follow-up questions for getting last modified time for each partition: 1. If we want to use snapshots, does this mean we will have to maintain full history of snapshots? E.g. if we partition by method=‘day’ and write once a day for a few years, we will end up in maintaining 1000s of

Re: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Ryan Blue
Mayur, This is one of the reasons why we want to introduce tagging in the format. That will allow you to tag snapshots that you want to keep and expire intermediate versions. In general, there is some cost to keeping thousands of snapshots. Those are held in the metadata file that gets written ea

Re: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Szehon Ho
> > 2. How can we distinguish between snapshots where new data was > added vs snapshots where compaction was done? > Yea, to answer the second question, I forgot to mention there is a field on Manifest Entries table called 'status' that you can filter on. It might not be documented as it's

Re: Change Data Capture for Iceberg

2022-03-07 Thread Anton Okolnychyi
Hey folks, Based on Yufei’s design doc and what we discussed during the sync, I shared my thoughts on what can be efficiently supported right now. https://github.com/apache/iceberg/issues/3941#issuecomment-1061153554 I’d