Re: [Discuss] Geospatial Support

2024-09-30 Thread Szehon Ho
and team are also volunteering to work on the prototype immediately afterwards. Thank you, Szehon On Tue, Aug 20, 2024 at 1:57 PM Szehon Ho wrote: > Hi all > > Please take a look at the proposed spec change to support Geo type for V3 > in : https://github.com/apache/iceberg/pul

Re: [DISCUSS] Action to Rewrite Equality Deletes as Position Deletes

2024-09-13 Thread Szehon Ho
+1, Id be happy to see this feature. Thanks Szehon On Fri, Sep 13, 2024 at 10:33 AM Prashant Singh wrote: > Hi All, > > Starting this thread to revive the discussion on converting Equality > Deletes as Position deletes and see if this is something community wants > now (Happy to contribute in t

Re: [Discuss] Geospatial Support

2024-08-20 Thread Szehon Ho
). Thanks, Szehon On Wed, Jun 26, 2024 at 7:29 PM Szehon Ho wrote: > Hi > > It was great to meet in person with Snowflake engineers and we had a good > discussion on the paths forward. > > Meeting notes for Snowflake- Iceberg sync. > >- Iceberg proposed Geometry type d

Re: Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-13 Thread Szehon Ho
Congratulations all, very well deserved! Thanks Szehon On Tue, Aug 13, 2024 at 10:25 PM Russell Spitzer wrote: > Hi Y'all, > > It is my pleasure to let everyone know that the Iceberg PMC has voted to > have several talented individuals join us. > > So without further ado, please welcome Péter V

Re: [DISCUSS] adoption of format version 3

2024-08-06 Thread Szehon Ho
10:19 PM Micah Kornfield < >>>>> emkornfi...@gmail.com> wrote: >>>>> >>>>>> It sounds like most of the opinions so far are waiting for the scope >>>>>> of work to finish before finalizing the specification. >>>>>>

Re: [DISCUSS] adoption of format version 3

2024-07-31 Thread Szehon Ho
Sorry I missed the sync this morning (sick), I'd like to push for geo too. I think on this front as per the last sync, Ryan recommended to wait for Parquet support to land, to avoid having two versions on Iceberg side (Iceberg-native vs Parquet-native). Parquet support is being actively worked on

Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
t 1:53 PM Szehon Ho wrote: > Hi, > > Also if I read it correctly, I think this proposal imposes the following > workflows in "spec" folders : > >1. Large and functional changes. These redirect to Iceberg >improvement proposals, which ends in code-modi

Re: [DISCUSS] Guidelines for committing PRs

2024-07-29 Thread Szehon Ho
Hi, Also if I read it correctly, I think this proposal imposes the following workflows in "spec" folders : 1. Large and functional changes. These redirect to Iceberg improvement proposals, which ends in code-modification vote 2. bug-fixes or clarification, which is specified to require

Re: [VOTE] Drop Java 8 support in Iceberg 1.7.0

2024-07-26 Thread Szehon Ho
+1 (binding) Thanks Szehon On Fri, Jul 26, 2024 at 8:55 AM Steven Wu wrote: > +1 (binding) > > I would also suggest keeping the vote open for 7 days for a larger > decision like this. > > > On Fri, Jul 26, 2024 at 8:50 AM Ryan Blue > wrote: > >> +1 >> >> On Fri, Jul 26, 2024 at 8:42 AM Russell

Re: Dropping JDK 8 support

2024-07-22 Thread Szehon Ho
+1 for dropping JDK 8 in Iceberg 2.0. I also wonder the same thing as Huaxin (sorry if I missed a previous thread on Iceberg 2.0 plan). Also as Huaxin has discovered in Spark 4.0 Support PR , looks like we may have to drop Java8 first in Spark 4.0 mod

Re: Building with JDK 21

2024-07-22 Thread Szehon Ho
Thanks Piotr for driving this, late +1 to add JDK 21 support and your plan for spotless. It seems ok to me too to bite the bullet and move to newer spotless (disabling spotless for JDK8 builds) post 1.6, but looks like the discussion happened and I'm fine either way. Thanks! Szehon On Mon, Jul 2

Re: [DISCUSS] DROP PARTITION in Spark

2024-07-17 Thread Szehon Ho
Hi Gabor I'm neutral for this, but can be convinced. My initial thoughts is that there would be no way to have ADD PARTITION (I assume old Hive workloads would rely on this), and these are not ANSI SQL standard statements as Spark moves to that direction. The second point of guaranteeing a metad

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-16 Thread Szehon Ho
lumns). >>> >>> How long are we going to keep the expired snapshot references by >>> default? If it is months/years, it can have major implications on the query >>> performance of metadata tables (like snapshots, all_*). >>> >>> I assume it will also have

Re: [VOTE] spec: remove the JSON spec for content file and file scan task sections

2024-07-11 Thread Szehon Ho
+1 Thanks Szehon On Thu, Jul 11, 2024 at 11:02 AM Daniel Weeks wrote: > +1 (binding) > > On Thu, Jul 11, 2024 at 10:54 AM Anurag Mantripragada > wrote: > >> +1 (non-binding) .Thanks Steve >> >> >> Anurag Mantripragada >> >> On Jul 11, 2024, at 10:27 AM, Yufei Gu wrote: >> >> +1 (binding) Than

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
e work on supporting DELETE/UPDATE/MERGE in > the DataFrame API? > Thanks, > Wing Yew > > > On Tue, Jul 9, 2024 at 10:05 PM Szehon Ho wrote: > >> Hi, >> >> Just FYI, good news, this change is merged on the Spark side : >> https://github.com/apache/spark/p

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2024-07-09 Thread Szehon Ho
Hi, Just FYI, good news, this change is merged on the Spark side : https://github.com/apache/spark/pull/46707 (its the third effort!). In next version of Spark, we will be able to pass read properties via SQL to a particular Iceberg table such as SELECT * FROM iceberg.db.table1 WITH (`locality`

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-09 Thread Szehon Ho
t even want to know >> of. If one can expire a snapshot from the middle of the history, that would >> be nice, so users would see only S1/S2/S4. The only downside is that >> reading S2 is less performant than reading S3, but IMHO this could be >> acceptable for having onl

Re: [DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-08 Thread Szehon Ho
implementations. Also, the type >>> of metadata tracked can differ depending on the use case. For example, >>> while LakeChime retains partition and operation type metadata, it does not >>> track file-level metadata as there was no specific use case for that. >>&

[DISCUSS] Extend Snapshot Metadata Lifecycle

2024-07-05 Thread Szehon Ho
Hi folks, I would like to discuss an idea for an optional extension of Iceberg's Snapshot metadata lifecycle. Thanks Piotr for replying on the other thread that this should be a fuller Iceberg format change. *Proposal Summary* Currently, ExpireSnapshots(long olderThan) purges metadata and delet

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
file removal without removing all the snapshot > information yet. > Please help my understand the reasoning behind these tradeoffs. > > Best > PF > > > > > On Thu, 4 Jul 2024 at 02:26, Szehon Ho <mailto:szehon.apa...@gmail.com>> wrote: >> Yes, I was ch

Re: [Proposal] REST Spec: Server-side Metadata Tables

2024-07-03 Thread Szehon Ho
Yes, I was chatting with Yufei about this, in the first glance I agree this would be nice to have. I always thought that metadata tables are important enough to spec somewhere, and I think this is a nice place to do it. There seems to be some overlap with existing calls (ie, you can get snapshots

Re: [Discuss] Geospatial Support

2024-06-26 Thread Szehon Ho
ored as a string, Iceberg cannot read it. This should be ok, as we only need this for XZ2 transform, where the user already passes in the info from CRS (up to user to make sure these align). Thanks Szehon On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho wrote: > Jia and I will sync with t

Re: Feedback Collection: Bylaws in Iceberg

2024-06-24 Thread Szehon Ho
Hi Also copying my previous response in private. Hi > Thanks Jack for taking the time for this doc. While the Iceberg community > and PMC so far has been one of the most collaborative, and I have > personally the utmost respect for those that laid the groundwork without > which we would not be h

Re: Making the NDV property required for theta sketch blobs in Puffin

2024-06-21 Thread Szehon Ho
It makes sense to me, normally changing optional -> required would probably require a version bump, but maybe it is ok here as it is a relatively new format, afaik adapted by Trino which already sets this field, but let's see if anyone disagrees. Thanks Szehon On Fri, Jun 21, 2024 at 3:35 PM huax

Re: Agenda Community Sync 19th June

2024-06-18 Thread Szehon Ho
Hi guys, The sync is Juneteenth (US federal holiday), so I think some folks on this side may miss, FYI PS (at least from my side) one highlight is the longstanding 1k column bug is finally fixed (at least partially) in https://github.com/apache/iceberg/pull/10020 Thanks Szehon On Tue, Jun 18, 2

Re: [Discuss] Geospatial Support

2024-06-18 Thread Szehon Ho
ote: >> >>> > The min/max stats are discussed in the doc (Phase 2), depending on the >>> non-trivial encoding. >>> >>> Just want to add that min/max stats filtering could be supported by file >>> format natively. Adding geometry type to parquet spec >>

Re: [Discuss] Geospatial Support

2024-06-05 Thread Szehon Ho
not many libs >> can parse projjson. >> >> @Szehon Is there a way that we can support both SRID and PROJJSON in Geo >> Iceberg? >> >> It is also worth noting that, although there are many libs that can parse >> SRID and perform look-up in the EPSG database,

Re: [Discuss] Geospatial Support

2024-05-29 Thread Szehon Ho
two > columns from different data providers. > > To address this we would like to propose including the option to specify > the SRS with only a SRID in phase 1. The query engine may choose to treat > it as opaque identified or make a look-up in the EPSG database of > supported. >

Re: [Discuss] Heap pressure with RewriteFiles APIs

2024-05-21 Thread Szehon Ho
Hi Naveen Yes it sounds like it will help to disable metrics for those columns? Iirc, by default it manifest entries have metrics at 'truncate(16)' level for 100 columns, which as you see can be quite memory intensive. A potential improvement later also is to have the ability to remove counts by

Re: Materialized Views: Next Steps

2024-05-10 Thread Szehon Ho
apache/spark/blob/2df494fd4e4e64b9357307fb0c5e8fc1b7491ac3/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/ViewInfo.java#L45 > > Thanks, > Walaa. > > On Thu, May 9, 2024 at 11:30 PM Szehon Ho wrote: > >> Hi Walaa >> >> As there may be confus

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
ent/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABK7e3QB4 > [2] > https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY/edit?pli=1&disco=AAABIonvCGE > > Thanks, > Walaa. > > > On Thu, May 9, 2024 at 5:49 PM Szehon Ho wro

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
t by now. If we agree, we can continue the > discussion on the PR, else, we can create a doc. > > Thanks, > Walaa. > > > On Thu, May 9, 2024 at 4:39 PM Szehon Ho wrote: > >> Thanks Walaa for driving it forward, looking forward to thinking about >> implementation

Re: Materialized Views: Next Steps

2024-05-09 Thread Szehon Ho
Thanks Walaa for driving it forward, looking forward to thinking about implementation of Materialized Views. I see Jan's point, the PR spec change is similar but does not seem to be completely aligned with the Draft Spec in the design doc: https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6

[Discuss] Geospatial Support

2024-05-01 Thread Szehon Ho
Hi everyone, We have created a formal proposal for adding Geospatial support to Iceberg. Please read the following for details. - Github Proposal : https://github.com/apache/iceberg/issues/10260 - Proposal Doc: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt2

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-22 Thread Szehon Ho
+1 for the approach given it reduces the work. On this, as it exposes storage tables to user catalog, I was mainly thinking we should have a common suffix/naming pattern for storage table across catalog. The netflix approach sounds good to me. Hope we can continue the proposal, as there's still

Re: [VOTE] Release Apache Iceberg 1.5.1 RC0

2024-04-22 Thread Szehon Ho
+1 (binding) * Verify signature * Verify checksum * Verify licenses * Build and run basic test with Spark 3.5 Thanks Szehon On Sun, Apr 21, 2024 at 11:45 PM Ajantha Bhat wrote: > +1 (non-binding) > > * validated checksum and signature > * checked license docs & ran RAT checks > * ran build and

Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
s back? > > On Fri, Mar 22, 2024 at 10:35 AM Szehon Ho > wrote: > >> Hi >> >> My understanding was last time it was still unresolved, and the action >> item was on Jack and/or/ Jan to make a shorter document. I think the >> debate now has boiled down to Ryan&

Re: Materialized view integration with REST spec

2024-03-22 Thread Szehon Ho
n 6: New MV spec with table and view metadata >>>>>>>>>>>>> >>>>>>>>>>>>> I originally excluded option 2 because I think it does not >>>>>>>>>>>>> align

Re: New committer: Renjie Liu

2024-03-11 Thread Szehon Ho
Congratulations! On Mon, Mar 11, 2024 at 12:43 PM Jack Ye wrote: > Congratulations Renjie! > > Best, > Jack Ye > > On Mon, Mar 11, 2024, 8:24 AM Ryan Blue wrote: > >> Congratulations, Renjie! Thanks for all your contributions! >> >> On Mon, Mar 11, 2024 at 12:52 AM Eduard Tudenhoefner >> wrote

Re: [VOTE] Release Apache Iceberg 1.5.0 RC6

2024-03-08 Thread Szehon Ho
+1 (binding) * Verified signature * Verified checksum * RAT check * built JDK 11 * Ran basic tests on Spark 3.5 Thanks Szehon On Fri, Mar 8, 2024 at 5:50 PM Amogh Jahagirdar wrote: > +1 non-binding > > Verified signatures,checksums,RAT checks, build, and tests with JDK11. I > also ran ad-hoc t

Re: New committer: Bryan Keller

2024-03-05 Thread Szehon Ho
Congratulations Bryan, well deserved, great work on Iceberg ! On Tue, Mar 5, 2024 at 8:14 AM Jack Ye wrote: > Congrats Bryan! > > -Jack > > On Tue, Mar 5, 2024 at 7:33 AM Amogh Jahagirdar wrote: > >> Congratulations Bryan! Very well deserved, thank you for all your >> contributions! >> >> On Tu

Re: [VOTE] Release Apache Iceberg 1.5.0 RC4

2024-03-01 Thread Szehon Ho
+1 (binding) - Verified signature - Verified checksum - RAT check - Compiled - Manually ran basic queries on Spark 3.5 On Fri, Mar 1, 2024 at 6:13 AM Fokko Driesprong wrote: > +1 (binding) > > - Checked checksum and signature > - Ran a modified version of dbt-spark to take advantage of the view

Re: Materialized view integration with REST spec

2024-02-29 Thread Szehon Ho
Hi Yes I mostly agree with the assessment. To clarify a few minor points. is a materialized view a view and a separate table, a combination of the > two (i.e. commits are combined), or a new metadata type? For 'new metadata type', I consider mostly Jack's initial proposal of a new Catalog MV o

Re: Materialized view integration with REST spec

2024-02-22 Thread Szehon Ho
o keep these separate from discussions about single points >>>> so that they can be persisted in the document. >>> >>> >>> Not sure if it helpful, but I added voting chips Question 0, as maybe an >>> easier way to keep track of votes. If it is helpful

Re: Materialized view integration with REST spec

2024-02-21 Thread Szehon Ho
f we think >>> this format is not effective, I propose that we create a new mv channel in >>> Iceberg Slack workspace, and people interested can join and discuss all >>> these points directly. What do we think? >>> >>> Best, >>> Jack Ye >

Re: Materialized view integration with REST spec

2024-02-19 Thread Szehon Ho
Hi, Great to see more discussion on the MV spec. Actually, Jan's document "Iceberg Materialized View Spec" has been organized , with a "Design Questions" section to track these debates, and it would be nice to centr

Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
Sorry I may have misunderstood the statement and maybe this is specific to multi-arg transform, in any case let's get a spec pr earlier in to discuss/specify behavior for V1-2 vs 3. Thanks Szehon On Tue, Jan 30, 2024 at 9:23 AM Szehon Ho wrote: > Thanks all for the discussion. >

Re: Spec change for multi-arg transform

2024-01-30 Thread Szehon Ho
ference is that >>>> for step 2, we typically just build one reference implementation in the >>>> Java library. We do vote on the large spec updates, but in this case you >>>> haven't seen one since we haven't built the reference implementation yet. >>&

Re: Spec change for multi-arg transform

2024-01-28 Thread Szehon Ho
Hi, This would not be retrofitting existing partition transforms, but just allowing for the creation of new multi-arg transforms. Is the concern that some implementations are never expecting new transforms to be added? Old implementations would indeed not be able to read Iceberg tables created w

Table owned locations

2023-08-29 Thread Szehon Ho
Hi all, As you know, there is a recurring Iceberg issue where delete orphan file operations may inadvertently delete other table's data, if they are misconfigured to have the same location. A while back, Anton had a proposal for 'owned.locations' in: https://github.com/apache/iceberg/issues/4159

Re: Proposal to fix the docs - this time it'll be different

2023-07-27 Thread Szehon Ho
Hi I'm ok with putting things back in Iceberg repo, it gets more visbility on prs. I guess it used to be a bit distracting, but now with more projects in Iceberg (pyiceberg, rust) we have to anyway use tags to filter through all the mails. Just wanted to +1 on Fokko/Ryan suggestion to avoid vers

[ANNOUNCE] Apache Iceberg release 1.3.1

2023-07-25 Thread Szehon Ho
I'm pleased to announce the release of Apache Iceberg 1.3.1! Apache Iceberg is an open table format for huge analytic datasets. Iceberg delivers high query performance for tables with tens of petabytes of data, along with atomic commits, concurrent writes, and SQL-compatible table evolution. This

[PASSED][VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-24 Thread Szehon Ho
Szehon On Mon, Jul 24, 2023 at 2:21 PM Szehon Ho wrote: > +1 (binding) > > 1. Verify signatures > 2. Verify checksums > 3. Verify license documentation > 4. Built and ran tests, only failure is TestS3RestSigner > 5. Ran simple queries against Spark 3.4 > > Thanks > S

Re: [VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-24 Thread Szehon Ho
ionStage.execute(ApiCallAttemptMetricCollectionStage.java:36) >> at >> software.amazon.awssdk.core.internal.http.pipeline.stages.RetryableStage.execute(RetryableStage.java:81) >> ... 23 more >> >> Best, >> >> Yufei >> >&g

[VOTE] Release Apache Iceberg 1.3.1 RC1

2023-07-17 Thread Szehon Ho
Hi Everyone, I propose that we release the following RC as the official Apache Iceberg 1.3.1 release. The commit ID is 62c34711c3f22e520db65c51255512f6cfe622c4 * This corresponds to the tag: apache-iceberg-1.3.1-rc1 * https://github.com/apache/iceberg/commits/apache-iceberg-1.3.1-rc1 * https://gi

Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-14 Thread Szehon Ho
it would be great to backport this to 1.3.x as > well. > > Kind regards, > Fokko > > Op wo 12 jul 2023 om 22:09 schreef Szehon Ho : > >> Hi guys >> >> Just an update on this. Another issue came up about the new 1.3.0 >> function rewrite_position_deletes (

Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-12 Thread Szehon Ho
eberg/milestones/Iceberg%201.3.1 Thanks Szehon On Mon, Jul 10, 2023 at 11:14 AM Szehon Ho wrote: > Thanks Eduard! Merged all your backport prs, I will commit the last one > probably tomorrow and then we can start the release. > > Thanks > Szehon > > On Sun, Jul 9, 2023 at 11

Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-10 Thread Szehon Ho
that we can start backporting those bug fixes. > > Eduard > > On Fri, Jul 7, 2023 at 6:52 PM Szehon Ho wrote: > >> Thanks a lot Eduard! I think https://github.com/apache/iceberg/pull/7933 >> is also a good candidate as well. >> >> Thanks, >> Szehon

Re: [DISCUSS] Apache Iceberg Release 1.3.1

2023-07-07 Thread Szehon Ho
Thu, Jul 6, 2023 at 9:02 PM Jean-Baptiste Onofré >> wrote: >> >>> Hi, >>> >>> It sounds good to me to have 1.3.1. >>> >>> Thanks ! >>> Regards >>> JB >>> >>> On Fri, Jul 7, 2023 at 12:53 AM Szehon Ho >&g

[DISCUSS] Apache Iceberg Release 1.3.1

2023-07-06 Thread Szehon Ho
Hi I wanted to start a discussion for whether its the right time for 1.3.1, a patch release of 1.3.0. It was started based on the issue found by Xiangyang (@ConeyLiu) : https://github.com/apache/iceberg/pull/7931#pullrequestreview-1507935277. Do people have any other bug fixes that should be inc

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-06-26 Thread Szehon Ho
> it. > > Thanks for reviving the effort. > Manu > > Szehon Ho 于2023年6月22日 周四00:45写道: > >> Hi, >> >> Yea, its definitely an issue. >> >> Fwiw, I was looking at reviving the old effort in Spark to pass in >> configs dynamically in Spark SQL s

Re: allowing configs to be specified in SQLConf for Spark reads/writes

2023-06-21 Thread Szehon Ho
Hi, Yea, its definitely an issue. Fwiw, I was looking at reviving the old effort in Spark to pass in configs dynamically in Spark SQL statement, which is probably the cleanest solution. (https://github.com/apache/spark/pull/34072 was the old effort, and I made https://github.com/apache/spark/pul

Re: Iceberg old partition gc

2023-06-03 Thread Szehon Ho
den the > metadata system. tagging can extend the history with selective snapshots. > > It seems that you are saying that purging actions of old partitions are > creating new snapshots, which are taking up some space in the snapshot > history. But if snapshot expiration is time based (

Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
t doing the delete. You can then recover > the snapshot if you happen to have accidentally TTL'd a partition. > > On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho wrote: > >> I think this violates Iceberg’s assumption of immutable snapshots. That >> would require modifying the

Re: Iceberg old partition gc

2023-06-02 Thread Szehon Ho
I think this violates Iceberg’s assumption of immutable snapshots. That would require modifying the old snapshot to no longer point to those gc’ed data files, else not sure how you can time-travel to read from that snapshot, if some of its files are deleted? That being said, I also had this thoug

Re: [DISCUSS] Default format version for new tables?

2023-05-24 Thread Szehon Ho
Hi, I'm +1 to making v2 the default, say after this release. It seems most of the features brought up as concerns on Spark side in the thread Gabor linked have been implemented (like position delete lifecycle). But Anton's point is also good. Even if some delete file features are missing, V2 is

Re: [VOTE] Release Apache Iceberg 1.3.0 RC0

2023-05-24 Thread Szehon Ho
+1 (binding) 1. verify signatures 2. verify checksum 3. verify license documentation 4. build and run tests 5. Ran simple tests on Spark 3.4 - Create simple table and check metadata tables - Ran 'delete from' statement to generate position delete, and run rewrite_position_delete Thanks Szehon On

Re: Welcome new committers and PMC!

2023-05-05 Thread Szehon Ho
Thanks all, really appreciate it, and congrats to Eduard and Amogh ! Szehon On Fri, May 5, 2023 at 12:37 AM Mingliang Liu wrote: > Congrats! All well deserved. > > On Thu, May 4, 2023 at 11:50 PM Eduard Tudenhoefner > wrote: > >> Thanks everyone, and also congrats to Amogh and Szehon! >> >> On

Re: tradeoffs between serializable vs snapshot isolation for single writer

2023-05-04 Thread Szehon Ho
Whoops, I didn’t see Ryan answer already. > On May 4, 2023, at 3:18 PM, Szehon Ho wrote: > > Hi, > > I believe it only matters if you have conflicting commits. For single writer > case, I think you are right and it should not matter, so you may save very > sligh

Re: tradeoffs between serializable vs snapshot isolation for single writer

2023-05-04 Thread Szehon Ho
Hi, I believe it only matters if you have conflicting commits. For single writer case, I think you are right and it should not matter, so you may save very slightly in performance by turning it to Snapshot Isolation. The checks are metadata checks though, so I would think it will not be a sig

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
g forward to the work in the phase 2 implementation. > Let me know if I can help, thanks. > > On Tue, May 2, 2023 at 4:28 PM Szehon Ho wrote: > >> Yea I agree, I had a handy query for the last update time of partition. >> >> SELECT >> >> e.data_file.partition, &

Re: [Proposal] Partition stats in Iceberg

2023-05-02 Thread Szehon Ho
Yea I agree, I had a handy query for the last update time of partition. SELECT e.data_file.partition, MAX(s.committed_at) AS last_modified_time FROM db.table.snapshots s JOIN db.table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP BY by e.data_file.partition It's a bit lengthy currentl

Re: Welcome new PMC members!

2023-04-12 Thread Szehon Ho
Nice, congratulations guys! Szehon On Wed, Apr 12, 2023 at 12:35 AM Gidon Gershinsky wrote: > Congrats Fokko, Steven, Yufei! > > Cheers, Gidon > > > On Wed, Apr 12, 2023 at 7:14 AM Ajantha Bhat > wrote: > >> Congratulations to all. >> >> On Wed, Apr 12, 2023 at 8:51 AM OpenInx wrote: >> >>> Co

Re: [VOTE] Release Apache Iceberg 1.2.1 RC2

2023-04-06 Thread Szehon Ho
+1 (non-binding) Verified signature Verified checksum Verified License Built and ran tests Ran simple queries on spark 3.3. Thanks Dan for the release, Szehon On Thu, Apr 6, 2023 at 12:04 PM Daniel Weeks wrote: > Hi Everyone, > > I propose that we release the following RC as the official Apach

Re: [Discuss] Allow all users who have Committed to the project to run CI without Approval

2023-03-29 Thread Szehon Ho
+1 Thanks Szehon On Wed, Mar 29, 2023 at 10:27 AM Eduard Tudenhoefner wrote: > +1 for "Only requires approval first time" > > On Wed, Mar 29, 2023 at 6:32 PM John Zhuge wrote: > >> +1 for "Only requires approval first time" >> >> On Wed, Mar 29, 2023 at 9:03 AM Ajantha Bhat >> wrote: >> >>> >

Re: [VOTE] Release Apache Iceberg 1.2.0 RC1

2023-03-15 Thread Szehon Ho
Hi, One note, on this release, I ran some simple spark-SQL using a local Spark, like "insert into table select 1". I find any of these operation now spawns 200 executors and takes awhile to finish. |== Physical Plan ==\nAppendData org.apache.spark.sql.execution.datasources.v2.DataSourceV2Strat

Re: In Remembrance of Kyle

2022-12-06 Thread Szehon Ho
Very shocked when I first heard this over the weekend. Became more sad when I learned how long he was sick for, and so humbled that he chose to spend so much of his last days with us in the Iceberg community. I did not have a chance to work directly with him in Apple as I was on a different team.

Re: [VOTE] Release Apache Iceberg 1.1.0 RC2

2022-11-17 Thread Szehon Ho
+1 (non-binding) 1. Verify signature 2. Verify checksum 3. License RAT check 4. Run unit test, Actually got a failure: org.apache.iceberg.spark.extensions.TestCopyOnWriteDelete > testDeleteWithSnapshotIsolation[catalogName = spark_catalog, implementation = org.apache.iceberg.spark.SparkSessionCa

RemoveDanglingDeleteFile proposal

2022-11-04 Thread Szehon Ho
Hi all, I made a proposal about adding a Spark Procedure RemoveDanglingDeleleteFiles. It would do a more comprehensive job to remove Delete Files that stay around after they become invalid (stop applying to Data Files), which happens in some cases, taking up storage and potentially affecting per

Re: [DISCUSS] October board report

2022-10-12 Thread Szehon Ho
Turoczy, Bill Zhang) - Apache Iceberg's REST Catalog - A Gateway to Enriching Data Access via the Simplicity of an HTTP Service (Sam Redai) - Iceberg's Best Secret: Exploring Metadata Tables (Szehon Ho) - Integrated Audits: Streamlined Data Observability with Apache Iceberg (Sam

Re: [VOTE] Release Apache Iceberg 1.0.0 RC0

2022-10-10 Thread Szehon Ho
:26 AM Szehon Ho wrote: > Hi, > > I get a NoClassDefFoundError from IcebergSparkExtensions when running > Spark 3.3, with iceberg-spark-runtime-3.3_2.12-1.0.0.jar. I noticed this > jar doesn't contain scala classes, unlike previous jars > iceberg-spark-runtime-3.3_2.1

Re: [VOTE] Release Apache Iceberg 1.0.0 RC0

2022-10-10 Thread Szehon Ho
Hi, I get a NoClassDefFoundError from IcebergSparkExtensions when running Spark 3.3, with iceberg-spark-runtime-3.3_2.12-1.0.0.jar. I noticed this jar doesn't contain scala classes, unlike previous jars iceberg-spark-runtime-3.3_2.12-0.14.1.jar. scala> spark.sql("show databases").show java.lang.

Re: Welcome Yufei Gu as a committer

2022-08-25 Thread Szehon Ho
Congratulations, Yufei! Thanks Szehon > On Aug 25, 2022, at 4:20 PM, Anton Okolnychyi > wrote: > > I’d like to welcome Yufei Gu as a committer to the project. > > Thanks for all your hard work, Yufei! > > - Anton

Re: Welcome Fokko Driesprong as a committer!

2022-08-22 Thread Szehon Ho
Congratulations! Szehon On Mon, Aug 22, 2022 at 12:25 PM Péter Váry wrote: > Congratulations Fokko! > > On Mon, Aug 22, 2022, 16:37 Jahagirdar, Amogh > wrote: > >> Congratulations Fokko! >> >> >> >> *From: *Gabor Kaszab >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Monday, August 22, 202

Re: [DISCUSS] Automatic Code Formatting / Code Style / Enforcing Code Style

2022-07-29 Thread Szehon Ho
Thanks for the auto formatting initiative, I think its really a time saver. I also agree about the line length, it would be better to keep it at 120 and a bummer it has to be reduced to 100 now. Looking at palantir-format, I actually like some of their format choices like line-length and also not

Re: [VOTE] Release Apache Iceberg 0.14.0 RC1

2022-07-15 Thread Szehon Ho
+1 (non-binding) - Verified signature - Verified checksum - Rat check - Could not find Apache license headers on iceberg-build.properties ( as mentioned by Ryan) - Ran tests - Same error mentioned by John: org.apache.iceberg.aws.s3.TestS3FileIO > testPrefixDel

Re: [VOTE] Adopt Puffin format as a file format for statistics and indexes

2022-06-09 Thread Szehon Ho
+1, it's an exciting step for Iceberg, look forward to all the new statistics and secondary indices it will allow. Had a few questions of what the reference to Puffin file(s) will be in the Iceberg spec, but it's orthogonal to Puffin file format itself. Thanks, Szehon On Thu, Jun 9, 2022 at 3:32

Re: [VOTE] Release Apache Iceberg 0.13.2 RC1

2022-06-06 Thread Szehon Ho
+1 (non-binding) 1. Verified signatures 2. Verified checksums 3. RAT checks 4. Build and test 5. Tested with Spark 3.2, create a table and run a few queries Thanks Szehon On Mon, Jun 6, 2022 at 10:46 AM Daniel Weeks wrote: > +1 (binding) > > verified sigs/sums/license/build/tes

Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-29 Thread Szehon Ho
On the other topic, the pr for 0.13 branch is merged: https://github.com/apache/iceberg/pull/4890, my preference will be to include this in new RC to solve the aforementioned issue : https://github.com/apache/iceberg/issues/4718. Thanks, Szehon On Sun, May 29, 2022 at 2:59 PM Szehon Ho wrote

Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-29 Thread Szehon Ho
eEE15k0XH39/ZCYPikR8XEqs0YkO > wdFeyrBN22jtT48jMJ4IFw4odabqOqBn6Wazx3tBg0ZMTxn/i2H4tHpe78RIj/7Z > 7eLhkMY0meA64TMBCc0aS3ffCnJzetWOSpgjv9o= > =gy3b > -END PGP PUBLIC KEY BLOCK- > > > > On May 28, 2022, at 2:04 PM, Szehon Ho wrote: > > Hi > > For gpg verify

Re: [VOTE] Release Apache Iceberg 0.13.2 RC0

2022-05-28 Thread Szehon Ho
Hi For gpg verify KEYS i get: gpg: Can't check signature: No public key I imported latest keys and do see key for : uid Russell Spitzer (CODE SIGNING KEY) sub rsa4096 2022-05-26 [E] but maybe no public key? Maybe I am missing something obvious. Also wanted to ask, can we get this

Re: Welcome Szehon Ho as a committer!

2022-03-11 Thread Szehon Ho
ufei Gu >>>> <mailto:flyrain...@gmail.com>> wrote: > >>>>> > >>>>> Congratulations Szehon! > >>>>> Best, > >>>>> > >>>>> Yufei > >>>>>

Re: Getting last modified timestamp/other stats per partition

2022-03-07 Thread Szehon Ho
; >> >> *From:* Mayur Srivastava >> *Sent:* Thursday, February 24, 2022 7:27 AM >> *To:* dev@iceberg.apache.org >> *Subject:* RE: Getting last modified timestamp/other stats per partition >> >> >> >> Thanks Szehon. I’ll give this a try. >> >&g

Re: Getting last modified timestamp/other stats per partition

2022-02-23 Thread Szehon Ho
Hi Probably the metadata tables can help with this. For the size/num_rows of partitions, you can query the files table, https://iceberg.apache.org/docs/latest/spark-queries/#files. (Because Iceberg keeps stats for files, and not necessary partitions). SELECT partition, sum(file_size_in_bytes),

Re: [VOTE] Release Apache Iceberg 0.13.0 RC2

2022-01-30 Thread Szehon Ho
+1 (non-binding) Verified signature Verified checksum Rat check Built and ran test, all succeed, after some temporary local HMS timeout Tested relevant jar with Spark 3.2, created various tables and ran queries Thanks Szehon On Fri, Jan 28, 2022 at 12:19 PM Russell Spitzer wrote: > +1 > All te

Re: Number of entries in manifest-list

2022-01-07 Thread Szehon Ho
of > `manifest_file` structs. > > Is there a general order-of-magnitude target number of `manifest_file` > structs? Presumably that would dictate when one would want to merge > manifest files and/or data files. > > Thanks again! > ggg > > > On Fri, Jan 7, 2022 at 1

Re: Number of entries in manifest-list

2022-01-07 Thread Szehon Ho
Hi, The manifest entries are one per data file or delete file, so depends how many data files/delete files your table has. Number of files is controlled mostly by the parallelism of the job that writes the table, though there are Iceberg RewriteDataFile utilities that can compact as well (as in y

Re: Welcome new PMC members!

2021-11-18 Thread Szehon Ho
Awesome, congratulations Jack and Russell! > On 18 Nov 2021, at 09:30, Ryan Murray wrote: > > Congratulations both! Well deserved! > > On Thu, 18 Nov 2021, 09:19 Omar Al-Safi, > wrote: > Congrats both of you! > > On Thu, Nov 18, 2021 at 8:31 AM Eduard Tudenhoefner

Re: [DISCUSS] Iceberg roadmap

2021-09-10 Thread Szehon Ho
Hi I also missed the last sync, and wanted to add two things if possible. Thanks, Szehon Priority 2: - Core: Predicate pushdown for remaining Metadata tables [medium] - Core/Spark: Support serializable isolation for ReplacePartitions / Insert Overwrite [medium] On Fri, Sep 10, 2021 a

Re: Iceberg python library sync

2021-08-12 Thread Szehon Ho
+1, would love to listen in as well Thanks, Szehon > On 12 Aug 2021, at 12:48, Arthur Wiedmer > wrote: > > Hi Jun, > > Please add me as well! > > Best, > Arthur > > > > On Thu, Aug 12, 2021 at 12:19 AM Jun H. > wrote: > Hi everyone, > > Since early this year,

  1   2   >