ue is.
On Mon, May 15, 2023 at 7:55 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Thanks Ryan.
For most partition stats, I’m ok with compaction and keeping fewer snapshots.
My concern was for supporting last modified time. I guess, if we need to keep
all snapshots to s
Tue, May 2, 2023 at 4:52 PM Mayur Srivastava
mailto:mayur.p.srivast...@gmail.com>> wrote:
Thanks for the response.
One of the use cases that we have is where one business day of data is added at
a time to a DAY partitioned table. With 25 years of this data, there will be
~6250 partitions
>>>>> we can consider a new field called "last modified time" to be included for
>>>>> the partitions stats (or have a plugable way to allow users to
>>>>> configure partition stats they need). My use case is to find out if a
>>>>
>>> partition is changed or not given two snapshots (old and new) with a
>>> quick and light way process. I previously was suggested by the community to
>>> use the change log (CDC) but I think that is too heavy (I guess, since it
>>> requires to run
like the latest sequence number or last modified time per
partition.
I will be opening up the discussion about phase 2 schema again once phase 1
implementation is done.
Thanks,
Ajantha
On Tue, Feb 7, 2023 at 8:15 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
+1 for the i
+1 for the initiative.
We’ve been exploring options for storing last-modified-time per partition. It
an important building block for data pipelines – especially if there is a
dependency between jobs with strong consistency requirements.
Is partition stats a good place for storing last-modified-
n get in the way.
Tagging will reduce the problem, and moving to change-based commits with the
REST catalog should also help in the long term.
Ryan
On Mon, Mar 7, 2022 at 8:18 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
A few follow-up questions for getting last
data was added vs
snapshots where compaction was done?
Thanks,
Mayur
From: Mayur Srivastava
Sent: Thursday, February 24, 2022 7:27 AM
To: dev@iceberg.apache.org
Subject: RE: Getting last modified timestamp/other stats per partition
Thanks Szehon. I’ll give this a try.
From: Szehon Ho
.data_file.partition
Hope that helps,
Szehon
On Wed, Feb 23, 2022 at 8:50 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
In Iceberg, is there a way to get the last modified timestamp and other stats
(e.g. num rows, uncompressed size, compressed size) of the data per partition?
Thanks,
Mayur
Hi,
In Iceberg, is there a way to get the last modified timestamp and other stats
(e.g. num rows, uncompressed size, compressed size) of the data per partition?
Thanks,
Mayur
I missing anything?
Thanks,
Jack Ye
On Fri, Dec 3, 2021 at 12:59 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
Let’s say there are N (e.g. 32) distributed processes writing to different
(non-overlapping) partitions in the same Iceberg table in parallel.
When al
Hi,
Let's say there are N (e.g. 32) distributed processes writing to different
(non-overlapping) partitions in the same Iceberg table in parallel.
When all of them finish writing, is there a way to do a single commit (by a
coordinator process) at the end so that either all or none is committed?
Hi,
Is there a best practice for handling the pandas.Timestamps (or
numpy.datetime64) in nanos in Iceberg? How are the Python users working with
the timestamps in nanos precision, especially if is a part of the PartitionSpec?
Thanks,
Mayur
currently expect to cut the release candidate next week.
Best,
Jack Ye
On Fri, Dec 3, 2021 at 12:43 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
What is the expected schedule for the 0.13 release?
Thanks,
Mayur
Hi,
What is the expected schedule for the 0.13 release?
Thanks,
Mayur
On Thu, Dec 2, 2021 at 1:18 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Looks like Jack is already on the top of the problem
(https://github.com/apache/iceberg/pull/3656). Thanks Jack!
From: Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>>
Sent: Thursday,
memory
utilization in cases when you write many files at once you just need to reduce
upload chunk size to 8MiB, for example:
fs.gs.outputstream.upload.chunk.size=8388608
On Wed, Dec 1, 2021 at 3:20 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
That is correct Daniel.
Looks like Jack is already on the top of the problem
(https://github.com/apache/iceberg/pull/3656). Thanks Jack!
From: Mayur Srivastava
Sent: Thursday, December 2, 2021 4:16 PM
To: dev@iceberg.apache.org
Subject: RE: Supporting gs:// prefix in S3URI for Google Cloud S3 Storage
There are three
ate different unique features of different
storage providers when the specialized feature request comes, and at that time
there is no difference from the dedicated FileIO + ResolvingFileIO architecture.
I wonder what Daniel thinks about this since I believe he is more interested in
multi-cl
dded the initial implementation:
https://github.com/apache/iceberg/pull/3593
-Jack
On Tue, Nov 30, 2021 at 6:41 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Thanks Ryan.
I’m looking at the heapdump. At a preliminary look in jvisualvm, I see the
following top two objects:
1.
really an interesting development.
-Dan
On Wed, Dec 1, 2021 at 8:12 AM Piotr Findeisen
mailto:pi...@starburstdata.com>> wrote:
if S3FileIO is supposed to be used with other file systems, we should consider
proper class renames.
just my 2c
On Wed, Dec 1, 2021 at 5:07 PM Mayur Srivastava
mailto:ma
>> wrote:
Sounds reasonable to me if they are compatible
On Wed, Dec 1, 2021 at 8:27 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
We have URIs starting with gs:// representing objects on GCS. Currently, S3URI
doesn’t support gs:// prefix (see
https://githu
+1 to the idea.
The events are very useful for building asynchronous services around Iceberg
such as observability, garbage collection, compaction, asynchronous table
deletion (to handle slow purge calls in the background) , etc.
It seems like the Iceberg catalog is a good place to configure/se
Hi,
We have URIs starting with gs:// representing objects on GCS. Currently, S3URI
doesn't support gs:// prefix (see
https://github.com/apache/iceberg/blob/master/aws/src/main/java/org/apache/iceberg/aws/s3/S3URI.java#L41).
Is there an existing JIRA for supporting this? Any objections to add "g
you here, although a Snapshot instance will cache
manifests after loading them if they are accessed, so you'd want to watch out
for that as well.
The best step forward is to get an idea of what objects are taking up that
space with a profiler or heap dump if you can.
Ryan
On Tu
Hi Iceberg Community,
I'm running some experiments with high commit contention (on the same Iceberg
table writing to different partitions) and I'm observing very high memory usage
(5G to 7G). (Note that the data being written is very small.)
The scenario is described below:
Note1: The catalog
We have some use cases where data and source code are linked. In such cases,
time travel to a snapshot with the same snapshot’s schema is desirable because
the source code expects the schema corresponding to the snapshot. The current
schema may be useful in some cases but it has correctness issu
Congratulations Jack and Russell!
On Thu, Nov 18, 2021 at 1:16 AM Peter Vary
wrote:
> Congratulations Jack and Russell!
>
> On Thu, 18 Nov 2021, 05:59 Gidon Gershinsky, wrote:
>
>> Congratulations guys!!
>>
>> Cheers, Gidon
>>
>>
>> On Thu, Nov 18, 2021 at 2:12 AM Ryan Blue wrote:
>>
>>> Hi ev
Hi,
I've been using Apache Iceberg for some time and most of the knowledge was
built by reading docs and asking questions to the community.
I'm wondering if there have been study groups or books on Apache Iceberg apart
from the website documentation, Github and blogs. I'm looking for a study gr
nk so. There's one that wraps the local file system we use for
testing that at least doesn't depend on Hadoop though. If you want to build an
in-memory one that would be great.
On Mon, Sep 27, 2021 at 7:32 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
rows with S3FileIO
Hi Mayur, sorry I did not follow up on this, were you able to fix the issue
with the AWS SDK upgrade?
-Jack Ye
On Thu, Sep 23, 2021 at 1:13 PM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
I’ll try to upgrade the version and retry.
Thanks,
Mayu
Hi,
Is there an in-memory implementation of the FileIO interface?
I'm looking for one for writing unit tests (basically avoiding touching local
file system or other external resources).
Thanks,
Mayur
reason to use that version specifically? Have you tried a newer
version? I know there have been quite a few updates to the S3 package related
to uploading since then, maybe upgrading can solve the problem.
-Jack
On Thu, Sep 23, 2021 at 11:02 AM Mayur Srivastava
mailto:mayur.srivast
issue, could you report what version of AWS SDK V2
you are using?
Best,
Jack Ye
On Thu, Sep 23, 2021 at 8:39 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
I've an Iceberg table partitioned by a single "time" (monthly partitioned)
column that has 4
Hi,
I've an Iceberg table partitioned by a single "time" (monthly partitioned)
column that has 400+ columns and >100k rows. I'm using parquet files and
PartitionedWriter + S3FileIO to write the data. When I write <~50k
rows, the writer works. But it fails with the exception below if I write mor
Hi Ryan,
Could you please add me to the community sync? (sorry, if this info is already
public).
Thanks,
Mayur
From: Ryan Blue
Sent: Monday, May 24, 2021 7:57 PM
To: dev@iceberg.apache.org
Subject: Next community sync
Hi everyone,
When I was out on paternity leave, I let the community syncs
021 at 11:46 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
I’m looking to use/implement a PostgreSQL based Iceberg catalog. I’m wondering
if one already exists and also have a few questions. I would really appreciate
any help I can get with the questions.
1.
Hi,
I'm looking to use/implement a PostgreSQL based Iceberg catalog. I'm wondering
if one already exists and also have a few questions. I would really appreciate
any help I can get with the questions.
1. Does Iceberg have a catalog that is compatible with PostgreSQL (or any
storage backen
Hi Ayush,
The iceberg-arrow changes that Ryan mentioned
(https://github.com/apache/iceberg/pull/2286) was merged recently but it is not
feature complete and require a bit more work. May be you could contribute to
make it better! Hope this helps.
Thanks,
Mayur
From: Ryan Blue
Sent: Sunday, Ma
the file
system or other tuning options that might be available before pulling the
entire file into memory (though that does provide an interesting comparison
point).
Just my thoughts on this. Let me know if any of that is unclear,
-Dan
On Tue, Mar 23, 2021 at 1:44 PM Mayur Srivast
Hi,
I've been running performance benchmarks on core Iceberg readers on Google
Cloud Storage (GCS). I would like to share some of my results and check whether
there are ways to improve performance on S3-like storage in general. The
details (including sample code) are listed below the question
>> Should we proceed with this pr and later add support for vectorized reads in
>> a separate pr?
I meant support deletes in the vectorized reader.
Thanks,
Mayur
From: Mayur Srivastava
Sent: Wednesday, March 3, 2021 6:41 AM
To: dev@iceberg.apache.org
Cc: Ryan Blue
Subject: RE:
eteFiles &&
(allOrcFileScanTasks ||
(allParquetFileScanTasks && atLeastOneColumn && onlyPrimitives));
Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> ezt írta
(időpont: 2021. márc. 2., Ke 15:48):
Hi Peter,
Good point.
Most of the ArrowReader code is inspired from the S
Peter
On Mar 1, 2021, at 16:17, Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi Ryan,
I’ve submitted a pr (https://github.com/apache/iceberg/pull/2286) for the
vectorized arrow reader:
This is my first Iceberg pull request - I'm not fully aware of the contributin
s not supported: TimeType, ListType, MapType, StructType. What is
the path to add Arrow support for these data types?
Thanks,
Mayur
From: Mayur Srivastava
Sent: Friday, February 12, 2021 7:41 PM
To: dev@iceberg.apache.org; rb...@netflix.com
Subject: RE: Reading data from Iceberg table into Apach
if you have code to convert
generics to Arrow, that's really useful to post somewhere.)
I hope that helps. It would be great to work with you to improve this in a
couple of PRs!
rb
On Thu, Feb 11, 2021 at 7:22 AM Mayur Srivastava
mailto:mayur.srivast...@twosigma.com>> wrote:
Hi,
Hi,
We have an existing time series data access service based on Arrow/Flight which
uses Apache Arrow format data to perform writes and reads (using time range
queries) from a bespoke table-backend based on a S3 compatible storage.
We are trying to replace our bespoke table-backend with Icebe
47 matches
Mail list logo