RE: [Proposal] Partition stats in Iceberg

2023-05-16 Thread Mayur Srivastava
ue is. On Mon, May 15, 2023 at 7:55 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Thanks Ryan. For most partition stats, I’m ok with compaction and keeping fewer snapshots. My concern was for supporting last modified time. I guess, if we need to keep all snapshots to s

RE: [Proposal] Partition stats in Iceberg

2023-05-15 Thread Mayur Srivastava
Tue, May 2, 2023 at 4:52 PM Mayur Srivastava mailto:mayur.p.srivast...@gmail.com>> wrote: Thanks for the response. One of the use cases that we have is where one business day of data is added at a time to a DAY partitioned table. With 25 years of this data, there will be ~6250 partitions

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Mayur Srivastava
>>>>> we can consider a new field called "last modified time" to be included for >>>>> the partitions stats (or have a plugable way to allow users to >>>>> configure partition stats they need). My use case is to find out if a >>>>

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Mayur Srivastava
>>> partition is changed or not given two snapshots (old and new) with a >>> quick and light way process. I previously was suggested by the community to >>> use the change log (CDC) but I think that is too heavy (I guess, since it >>> requires to run

RE: [Proposal] Partition stats in Iceberg

2023-02-07 Thread Mayur Srivastava
like the latest sequence number or last modified time per partition. I will be opening up the discussion about phase 2 schema again once phase 1 implementation is done. Thanks, Ajantha On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: +1 for the i

RE: [Proposal] Partition stats in Iceberg

2023-02-07 Thread Mayur Srivastava
+1 for the initiative. We’ve been exploring options for storing last-modified-time per partition. It an important building block for data pipelines – especially if there is a dependency between jobs with strong consistency requirements. Is partition stats a good place for storing last-modified-

RE: Getting last modified timestamp/other stats per partition

2022-03-08 Thread Mayur Srivastava
n get in the way. Tagging will reduce the problem, and moving to change-based commits with the REST catalog should also help in the long term. Ryan On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: A few follow-up questions for getting last

RE: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Mayur Srivastava
data was added vs snapshots where compaction was done? Thanks, Mayur From: Mayur Srivastava Sent: Thursday, February 24, 2022 7:27 AM To: dev@iceberg.apache.org Subject: RE: Getting last modified timestamp/other stats per partition Thanks Szehon. I’ll give this a try. From: Szehon Ho

RE: Getting last modified timestamp/other stats per partition

2022-02-24 Thread Mayur Srivastava
.data_file.partition Hope that helps, Szehon On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition? Thanks, Mayur

Getting last modified timestamp/other stats per partition

2022-02-23 Thread Mayur Srivastava
Hi, In Iceberg, is there a way to get the last modified timestamp and other stats (e.g. num rows, uncompressed size, compressed size) of the data per partition? Thanks, Mayur

RE: Single multi-process commit

2021-12-03 Thread Mayur Srivastava
I missing anything? Thanks, Jack Ye On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, Let’s say there are N (e.g. 32) distributed processes writing to different (non-overlapping) partitions in the same Iceberg table in parallel. When al

Single multi-process commit

2021-12-03 Thread Mayur Srivastava
Hi, Let's say there are N (e.g. 32) distributed processes writing to different (non-overlapping) partitions in the same Iceberg table in parallel. When all of them finish writing, is there a way to do a single commit (by a coordinator process) at the end so that either all or none is committed?

Handling pandas.Timestamps in nanos

2021-12-03 Thread Mayur Srivastava
Hi, Is there a best practice for handling the pandas.Timestamps (or numpy.datetime64) in nanos in Iceberg? How are the Python users working with the timestamps in nanos precision, especially if is a part of the PartitionSpec? Thanks, Mayur

RE: Schedule for 0.13 release

2021-12-03 Thread Mayur Srivastava
currently expect to cut the release candidate next week. Best, Jack Ye On Fri, Dec 3, 2021 at 12:43 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, What is the expected schedule for the 0.13 release? Thanks, Mayur

Schedule for 0.13 release

2021-12-03 Thread Mayur Srivastava
Hi, What is the expected schedule for the 0.13 release? Thanks, Mayur

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-03 Thread Mayur Srivastava
On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Looks like Jack is already on the top of the problem (https://github.com/apache/iceberg/pull/3656). Thanks Jack! From: Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> Sent: Thursday,

RE: High memory usage with highly concurrent committers

2021-12-03 Thread Mayur Srivastava
memory utilization in cases when you write many files at once you just need to reduce upload chunk size to 8MiB, for example: fs.gs.outputstream.upload.chunk.size=8388608 On Wed, Dec 1, 2021 at 3:20 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: That is correct Daniel.

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Mayur Srivastava
Looks like Jack is already on the top of the problem (https://github.com/apache/iceberg/pull/3656). Thanks Jack! From: Mayur Srivastava Sent: Thursday, December 2, 2021 4:16 PM To: dev@iceberg.apache.org Subject: RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage There are three

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-02 Thread Mayur Srivastava
ate different unique features of different storage providers when the specialized feature request comes, and at that time there is no difference from the dedicated FileIO + ResolvingFileIO architecture. I wonder what Daniel thinks about this since I believe he is more interested in multi-cl

RE: High memory usage with highly concurrent committers

2021-12-01 Thread Mayur Srivastava
dded the initial implementation: https://github.com/apache/iceberg/pull/3593 -Jack On Tue, Nov 30, 2021 at 6:41 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Thanks Ryan. I’m looking at the heapdump. At a preliminary look in jvisualvm, I see the following top two objects: 1.

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Mayur Srivastava
really an interesting development. -Dan On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen mailto:pi...@starburstdata.com>> wrote: if S3FileIO is supposed to be used with other file systems, we should consider proper class renames. just my 2c On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava mailto:ma

RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Mayur Srivastava
>> wrote: Sounds reasonable to me if they are compatible On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, We have URIs starting with gs:// representing objects on GCS. Currently, S3URI doesn’t support gs:// prefix (see https://githu

RE: Iceberg event notification support

2021-12-01 Thread Mayur Srivastava
+1 to the idea. The events are very useful for building asynchronous services around Iceberg such as observability, garbage collection, compaction, asynchronous table deletion (to handle slow purge calls in the background) , etc. It seems like the Iceberg catalog is a good place to configure/se

Supporting gs:// prefix in S3URI for Google Cloud S3 Storage

2021-12-01 Thread Mayur Srivastava
Hi, We have URIs starting with gs:// representing objects on GCS. Currently, S3URI doesn't support gs:// prefix (see https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41). Is there an existing JIRA for supporting this? Any objections to add "g

RE: High memory usage with highly concurrent committers

2021-11-30 Thread Mayur Srivastava
you here, although a Snapshot instance will cache manifests after loading them if they are accessed, so you'd want to watch out for that as well. The best step forward is to get an idea of what objects are taking up that space with a profiler or heap dump if you can. Ryan On Tu

High memory usage with highly concurrent committers

2021-11-30 Thread Mayur Srivastava
Hi Iceberg Community, I'm running some experiments with high commit contention (on the same Iceberg table writing to different partitions) and I'm observing very high memory usage (5G to 7G). (Note that the data being written is very small.) The scenario is described below: Note1: The catalog

RE: Read schema to use during time travel

2021-11-22 Thread Mayur Srivastava
We have some use cases where data and source code are linked. In such cases, time travel to a snapshot with the same snapshot’s schema is desirable because the source code expects the schema corresponding to the snapshot. The current schema may be useful in some cases but it has correctness issu

Re: Welcome new PMC members!

2021-11-17 Thread Mayur Srivastava
Congratulations Jack and Russell! On Thu, Nov 18, 2021 at 1:16 AM Peter Vary wrote: > Congratulations Jack and Russell! > > On Thu, 18 Nov 2021, 05:59 Gidon Gershinsky, wrote: > >> Congratulations guys!! >> >> Cheers, Gidon >> >> >> On Thu, Nov 18, 2021 at 2:12 AM Ryan Blue wrote: >> >>> Hi ev

Study groups or books on Iceberg

2021-10-25 Thread Mayur Srivastava
Hi, I've been using Apache Iceberg for some time and most of the knowledge was built by reading docs and asking questions to the community. I'm wondering if there have been study groups or books on Apache Iceberg apart from the website documentation, Github and blogs. I'm looking for a study gr

RE: In-memory implementation of FileIO

2021-10-14 Thread Mayur Srivastava
nk so. There's one that wraps the local file system we use for testing that at least doesn't depend on Hadoop though. If you want to build an in-memory one that would be great. On Mon, Sep 27, 2021 at 7:32 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi,

RE: Error when writing large number of rows with S3FileIO

2021-10-07 Thread Mayur Srivastava
rows with S3FileIO Hi Mayur, sorry I did not follow up on this, were you able to fix the issue with the AWS SDK upgrade? -Jack Ye On Thu, Sep 23, 2021 at 1:13 PM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: I’ll try to upgrade the version and retry. Thanks, Mayu

In-memory implementation of FileIO

2021-09-27 Thread Mayur Srivastava
Hi, Is there an in-memory implementation of the FileIO interface? I'm looking for one for writing unit tests (basically avoiding touching local file system or other external resources). Thanks, Mayur

RE: Error when writing large number of rows with S3FileIO

2021-09-23 Thread Mayur Srivastava
reason to use that version specifically? Have you tried a newer version? I know there have been quite a few updates to the S3 package related to uploading since then, maybe upgrading can solve the problem. -Jack On Thu, Sep 23, 2021 at 11:02 AM Mayur Srivastava mailto:mayur.srivast

RE: Error when writing large number of rows with S3FileIO

2021-09-23 Thread Mayur Srivastava
issue, could you report what version of AWS SDK V2 you are using? Best, Jack Ye On Thu, Sep 23, 2021 at 8:39 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, I've an Iceberg table partitioned by a single "time" (monthly partitioned) column that has 4

Error when writing large number of rows with S3FileIO

2021-09-23 Thread Mayur Srivastava
Hi, I've an Iceberg table partitioned by a single "time" (monthly partitioned) column that has 400+ columns and >100k rows. I'm using parquet files and PartitionedWriter + S3FileIO to write the data. When I write <~50k rows, the writer works. But it fails with the exception below if I write mor

RE: Next community sync

2021-05-24 Thread Mayur Srivastava
Hi Ryan, Could you please add me to the community sync? (sorry, if this info is already public). Thanks, Mayur From: Ryan Blue Sent: Monday, May 24, 2021 7:57 PM To: dev@iceberg.apache.org Subject: Next community sync Hi everyone, When I was out on paternity leave, I let the community syncs

RE: Iceberg catalog questions

2021-05-11 Thread Mayur Srivastava
021 at 11:46 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi, I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m wondering if one already exists and also have a few questions. I would really appreciate any help I can get with the questions. 1.

Iceberg catalog questions

2021-05-11 Thread Mayur Srivastava
Hi, I'm looking to use/implement a PostgreSQL based Iceberg catalog. I'm wondering if one already exists and also have a few questions. I would really appreciate any help I can get with the questions. 1. Does Iceberg have a catalog that is compatible with PostgreSQL (or any storage backen

RE: Queries regarding fetching of old iceberg schema and how to use iceberg-arrow apis

2021-05-10 Thread Mayur Srivastava
Hi Ayush, The iceberg-arrow changes that Ryan mentioned (https://github.com/apache/iceberg/pull/2286) was merged recently but it is not feature complete and require a bit more work. May be you could contribute to make it better! Hope this helps. Thanks, Mayur From: Ryan Blue Sent: Sunday, Ma

RE: Single Reader Benchmarks on S3-like Storage

2021-03-23 Thread Mayur Srivastava
the file system or other tuning options that might be available before pulling the entire file into memory (though that does provide an interesting comparison point). Just my thoughts on this. Let me know if any of that is unclear, -Dan On Tue, Mar 23, 2021 at 1:44 PM Mayur Srivast

Single Reader Benchmarks on S3-like Storage

2021-03-23 Thread Mayur Srivastava
Hi, I've been running performance benchmarks on core Iceberg readers on Google Cloud Storage (GCS). I would like to share some of my results and check whether there are ways to improve performance on S3-like storage in general. The details (including sample code) are listed below the question

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-03 Thread Mayur Srivastava
>> Should we proceed with this pr and later add support for vectorized reads in >> a separate pr? I meant support deletes in the vectorized reader. Thanks, Mayur From: Mayur Srivastava Sent: Wednesday, March 3, 2021 6:41 AM To: dev@iceberg.apache.org Cc: Ryan Blue Subject: RE:

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-03 Thread Mayur Srivastava
eteFiles && (allOrcFileScanTasks || (allParquetFileScanTasks && atLeastOneColumn && onlyPrimitives)); Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> ezt írta (időpont: 2021. márc. 2., Ke 15:48): Hi Peter, Good point. Most of the ArrowReader code is inspired from the S

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-02 Thread Mayur Srivastava
Peter On Mar 1, 2021, at 16:17, Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi Ryan, I’ve submitted a pr (https://github.com/apache/iceberg/pull/2286) for the vectorized arrow reader: This is my first Iceberg pull request - I'm not fully aware of the contributin

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-03-01 Thread Mayur Srivastava
s not supported: TimeType, ListType, MapType, StructType. What is the path to add Arrow support for these data types? Thanks, Mayur From: Mayur Srivastava Sent: Friday, February 12, 2021 7:41 PM To: dev@iceberg.apache.org; rb...@netflix.com Subject: RE: Reading data from Iceberg table into Apach

RE: Reading data from Iceberg table into Apache Arrow in Java

2021-02-12 Thread Mayur Srivastava
if you have code to convert generics to Arrow, that's really useful to post somewhere.) I hope that helps. It would be great to work with you to improve this in a couple of PRs! rb On Thu, Feb 11, 2021 at 7:22 AM Mayur Srivastava mailto:mayur.srivast...@twosigma.com>> wrote: Hi,

Reading data from Iceberg table into Apache Arrow in Java

2021-02-11 Thread Mayur Srivastava
Hi, We have an existing time series data access service based on Arrow/Flight which uses Apache Arrow format data to perform writes and reads (using time range queries) from a bespoke table-backend based on a S3 compatible storage. We are trying to replace our bespoke table-backend with Icebe