from:"Russell Spitzer"

[DISCUSS] Apache Iceberg 1.7.0 Release Cutoff

2024-10-03 Thread Russell Spitzer

Hi y'all!

As discussed at the community sync on Wednesday, October has begun and we
are beginning to flesh out the 1.7.0 release as well as the V3 Table Spec.
Since we are a little worried that we won't have all of the Spec items we
want by the end of October,  we discussed that we may want to just do a
release with what we have at the end of the month.

It was noted that we have a lot of exciting things already in and we may be
able to get just the beginning of V3 support as well.

To that end it was proposed that we do the next Iceberg release at the end
of the month (Oct 31st) , and have the cutoff a week before (Oct 25th).
Does anyone have objections or statements of support for this plan?

With this in mind please also start marking any remaining PR's or Issues
that you have with 1.7.0 so we can prioritize them for the cutoff date.


Thanks everyone for your time,
Russ

Re: [Discuss] Geospatial Support

2024-09-30 Thread Russell Spitzer

All my concerns are addressed, I'm ready to vote.

On Mon, Sep 30, 2024 at 1:21 PM Szehon Ho  wrote:

> Hi all,
>
> There have been several rounds of discussion on the PR:
> https://github.com/apache/iceberg/pull/10981 and I think most of the main
> points have been addressed.
>
> If anyone is interested, please take a look.  If there are no other major
> points, we plan to start a VOTE thread soon.
>
> I know Jia and team are also volunteering to work on the prototype
> immediately afterwards.
>
> Thank you,
> Szehon
>
> On Tue, Aug 20, 2024 at 1:57 PM Szehon Ho  wrote:
>
>> Hi all
>>
>> Please take a look at the proposed spec change to support Geo type for V3
>> in : https://github.com/apache/iceberg/pull/10981, and comment or
>> otherwise let me know your thoughts.
>>
>> Just as an FYI it incorporated the feedback from our last meeting (with
>> Snowflake and Wherobots engineers).
>>
>> Thanks,
>> Szehon
>>
>> On Wed, Jun 26, 2024 at 7:29 PM Szehon Ho 
>> wrote:
>>
>>> Hi
>>>
>>> It was great to meet in person with Snowflake engineers and we had a
>>> good discussion on the paths forward.
>>>
>>> Meeting notes for Snowflake- Iceberg sync.
>>>
>>>- Iceberg proposed Geometry type defaults to (edges=planar ,
>>>crs=CRS84).
>>>- Snowflake has two types Geography (spherical) and Geometry
>>>(planar, with customizable CRS).  The data layout/encoding is the same 
>>> for
>>>both types.  Let's see how we can support each in Iceberg type, 
>>> especially
>>>wrt Iceberg partition/file pruning
>>>- Geography type support
>>>- Main concern is the need for a suitable partition transform for
>>>   partition-level filter, the candidate is Micahel Entin's proposal
>>>   
>>> 
>>>   .
>>>   - Secondary concern is file and RG-level filtering.  Gang's Parquet
>>>   proposal  
>>> allow
>>>   storage of S2 / H3 ID's in Parquet stats, and so we can also leverage 
>>> that
>>>   in Iceberg pruning code (Google and Uber libraries are compatible)
>>>- Geometry type support
>>>   -  Main concern is partition transform needs to understand CRS,
>>>   but this can be solved by having XZ2 transform created with 
>>> customizable
>>>   min/max lat/long range (its all it needs)
>>>- Should (CRS, edges) be stored properties on Geography type in
>>>Phase 1?
>>>   - Should be fine to store, with only allowing defaults in Phase 1.
>>>   - Concern 1: If edges is stored, there will be ask to store other
>>>   properties like (orientation, epoch).  Solution is to punt these 
>>> follow-on
>>>   properties for later.
>>>   - Concern 2: if crs is stored, what format?  PROJJSON vs SRID.
>>>   Solution is to leave it as a string
>>>   - Concern 3: if crs is stored as a string, Iceberg cannot read
>>>   it.  This should be ok, as we only need this for XZ2 transform, where 
>>> the
>>>   user already passes in the info from CRS (up to user to make sure 
>>> these
>>>   align).
>>>
>>> Thanks
>>> Szehon
>>>
>>> On Tue, Jun 18, 2024 at 12:23 PM Szehon Ho 
>>> wrote:
>>>
 Jia and I will sync with the Snowflake folks to see if we can have a
 solution, or roadmap to solution, in the proposal.

 Thanks JB for the interest!  By the way, I want to schedule a meeting
 to go over the proposal, it seems there's good feedback from folks from geo
 side (and even Parquet community), but not too many eyes/feedback from
 other folks/PMC on Iceberg community.  This might be due to lack of
 familiarity/ time to read through it all.  In fact, a lot of the advanced
 discussions like this one are for Phase 2 items, and Phase 1 items are
 relatively straightforward, so wanted to explain that.  As I know its
 summer vacation for some folks, we can do this in a week or early July,
 hope that sounds good with everyone.

 Thanks,
 Szehon

 On Tue, Jun 18, 2024 at 1:54 AM Jean-Baptiste Onofré 
 wrote:

> Hi Jia
>
> Thanks for the update. I'm gonna re-read the whole thread and document
> to have a better understanding.
>
> Thanks !
> Regards
> JB
>
> On Mon, Jun 17, 2024 at 7:44 PM Jia Yu  wrote:
>
>> Hi Snowflake folks,
>>
>> Please let me know if you have other questions regarding the
>> proposal. If any, Szehon and I can set up a zoom call with you guys to
>> clarify some details. We are in the Pacific time zone. If you are in
>> Europe, maybe early morning Pacific Time works best for you?
>>
>> Thanks,
>> Jia
>>
>> On Wed, Jun 5, 2024 at 6:28 PM Gang Wu  wrote:
>>
>>> > The min/max stats are discussed in the doc (Phase 2), depending on
>>> the non-trivial encoding.
>>>
>>> Just want to add that min/max stats filtering could be s

Re: [VOTE] Table v3 spec: Add unknown and new type promotion

2024-09-27 Thread Russell Spitzer

+1 (binding)

On Fri, Sep 27, 2024 at 4:37 PM rdb...@gmail.com  wrote:

> Hi everyone,
>
> I'd like to vote on PR #10955
>  that has been open for a
> while with the changes to add new type promotion cases. After discussion,
> the PR has been scoped down to keep complexity low. It now adds:
>
> * An `unknown` type for cases when only `null` values have been observed
> * Type promotion from `unknown` to any other type
> * Type promotion from `date` to `timestamp` or `timestamp_ns`
> * Clarification that promotion is not allowed if it breaks transform
> results
>
> The set of changes is quite a bit smaller than originally proposed because
> of the issue already discussed about lower and upper bounds values, and it
> no longer includes variant. I think that we can add more type promotion
> cases after we improve bounds metadata. This adds what we can now to keep
> v3 moving forward.
>
> Please vote in the next 72 hours:
>
> [ ] +1, commit the proposed spec changes
> [ ] -0
> [ ] -1, do not make these changes because . . .
>
> Thanks,
>
> Ryan
>

Re: [DISCUSS] Iceberg Summit 2025 ?

2024-09-27 Thread Russell Spitzer

I am really excited about the prospect of another Summit and also had a
great time last year. I think we had a great selection of talks and I'm
hoping we can do so again.

I'm very much in support of having an in person element, I would love to
have a chance to talk face to face with other members of the community. I
do think we should
preserve online viewing as well since I know not everyone has the ability
to travel.

I do hope that we can have more talks about users with Iceberg in
production as well. I think we did a really good job of covering Iceberg
development last time but didn't
have as many practitioner discussions as I would have liked. I also think
it would be great if we had a section that was purely just "ideas for
Iceberg" where folks can pitch
their features and proposals to a much broader audience.

I also would love to have some workshops this time as well, showing folks
how to use the project, how to make their first tables, and how to
contribute to the Iceberg project.

Things I'd like to avoid: Sales pitches, Talks not focused on Iceberg or
its ecosystem (Personally I don't really want to hear anything about AI or
LLMS but I know that might not be everyone). Ideally I would like this to
be a vendor neutral event where planning is as transparent as possible for
the community.

I'd love to hear what other folks are thinking,
Russ

On Fri, Sep 27, 2024 at 12:51 PM Jean-Baptiste Onofré 
wrote:

> Hi folks,
>
> Last year in June we started to discuss the first edition of the Iceberg
> Summit (https://lists.apache.org/thread/cbgx1jlc9ywn618yod2487g498lgrkt3).
>
>
> The Iceberg Summit was in May 2024, and it was clearly a great community
> event, with a lot of nice talks.
> This first edition was fully virtual.
>
> I think it would be great to have Iceberg Summit 2025, community event,
> but maybe this time a hybrid event.
> Also, regarding the number of talks received by the selection committee
> for Iceberg Summit 2024, I would suggest (for the future Selection
> Committee) to have new talk tracks (like user stories, practitioners, ...).
>
> The process would be similar of Iceberg Summit 2024:
> - first the community discuss here about the idea, kind of event (virtual,
> in person, hybrid), ...
>* should we have another event ?
>* would you like there to be an in-person event ?
>* what kind of talks would you like to hear at such an event ?
>* what kind of talks would you not like to hear at such an event ?
> - if there's no objections, the Iceberg PMC should approve the use of
> Iceberg and the ASF VP M&P should be notified. I can help on the paperwork
> and process again.
> - the PMC will appoint two committees (at least selection and sponsoring
> committees)
>
> Thoughts ?
>
> Regards
> JB
>

Re: Clarification on DayTransform Result Type

2024-09-27 Thread Russell Spitzer

Good thing DateType is an Integer :)
https://github.com/apache/iceberg/blob/113c6e7d62e53d3e3cb15b1712f3a1db473ca940/api/src/main/java/org/apache/iceberg/types/Type.java#L37

On Thu, Sep 26, 2024 at 8:38 PM Kevin Liu  wrote:

> Hey folks,
>
> While reviewing a PR to fix DayTransform in PyIceberg (#1208
> ), we found an
> inconsistency between the spec and the Java Iceberg library.
>
> According to the spec
> , the result type
> for the "day partition transform" should be `int`, similar to other
> time-based partition transforms (year/month/hour). However, in the Java
> Iceberg library, the result type for day partition transform is `DateType` (
> source
> ).
> This seems to be a discrepancy from the spec, as the day partition
> transform is the only time-based transform with a non-int result
> type—whereas the others use IntegerType (source
> 
> ).
>
> Could someone confirm if my understanding is correct? If so, is there any
> historical context for this difference? Lastly, how should we approach
> resolving this moving forward?
>
> Best,
> Kevin
>
>

Re: [DISCUSS] Spark 3.5.3 breaks Iceberg SparkSessionCatalog

2024-09-25 Thread Russell Spitzer

Checked and extending Delegating Catalog Extension will be quite difficult
or at least cause several breaks in current Iceberg SparkSessionCatalog
implementations. Note this has nothing to do with third party catalogs but
more directly with how Iceberg works with Spark regardless of Catalog
implementation.

Main issues on the Iceberg side:

1. Initialize is final and empty in DelegatingCatalogExtension. This means
we have no way of taking custom catalog configuration and applying to the
Iceberg plugin. Currently this is used for a few things; choosing the
underlying Iceberg catalog implementation, catalog cache settings, Iceberg
environment context.

2. No access to delegate catalog object. The delegate is private so we are
unable to touch it in our extended class which is currently used for
Iceberg's "staged create" and "staged replace" functions. Here we could
just work around this by disabling staged create and replace if the
delegate is being used but that would be a break iceberg behavior.

Outside of these aspects I was able to get everything else working as
expected but I think both of these are probably blockers.

On Wed, Sep 25, 2024 at 3:51 PM Russell Spitzer 
wrote:

> I think it should be minimally difficult to switch this around on the
> Iceberg side, we only have to move the initialize code out and duplicate
> it. Not a huge cost
>
> On Sun, Sep 22, 2024 at 11:39 PM Wenchen Fan  wrote:
>
>> It's a buggy behavior that a custom v2 catalog (without extending
>> DelegatingCatalogExtension) expects Spark to still use the v1 DDL commands
>> to operate on the tables inside it. This is also why the third-party
>> catalogs (e.g. Unity Catalog and Apache Polaris) can not be used to
>> overwrite `spark_catalog` if people still want to use the Spark built-in
>> file sources.
>>
>> Technically, I think it's wrong for a third-party catalog to rely on
>> Spark's session catalog without extending `DelegatingCatalogExtension`, as
>> it confuses Spark. If it has its own metastore, then it shouldn't delegate
>> requests to the Spark session catalog and use v1 DDL commands which only
>> work with the Spark session catalog. Otherwise, it should extend
>> `DelegatingCatalogExtension` to indicate it.
>>
>> On Mon, Sep 23, 2024 at 11:19 AM Manu Zhang 
>> wrote:
>>
>>> Hi Iceberg and Spark community,
>>>
>>> I'd like to bring your attention to a recent change[1] in Spark 3.5.3
>>> that effectively breaks Iceberg's SparkSessionCatalog[2] and blocks Iceberg
>>> upgrading to Spark 3.5.3[3].
>>>
>>> SparkSessionCatalog, as a customized Spark V2 session catalog,
>>> supports creating a V1 table with V1 command. That's no longer allowed
>>> after the change unless it extends DelegatingCatalogExtension. It is not
>>> minor work since SparkSessionCatalog already extends a base class[4].
>>>
>>> To resolve this issue, we have to make changes to public interfaces at
>>> either Spark or Iceberg side. IMHO, it doesn't make sense for a downstream
>>> project to refactor its interfaces when bumping up a maintenance version of
>>> Spark. WDYT?
>>>
>>>
>>> 1. https://github.com/apache/spark/pull/47724
>>> 2.
>>> https://iceberg.apache.org/docs/nightly/spark-configuration/#replacing-the-session-catalog
>>> 3. https://github.com/apache/iceberg/pull/11160
>>> <https://github.com/apache/iceberg/pull/11160>
>>> 4.
>>> https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
>>>
>>> Thanks,
>>> Manu
>>>
>>>

Re: [DISCUSS] Spark 3.5.3 breaks Iceberg SparkSessionCatalog

2024-09-25 Thread Russell Spitzer

I think it should be minimally difficult to switch this around on the
Iceberg side, we only have to move the initialize code out and duplicate
it. Not a huge cost

On Sun, Sep 22, 2024 at 11:39 PM Wenchen Fan  wrote:

> It's a buggy behavior that a custom v2 catalog (without extending
> DelegatingCatalogExtension) expects Spark to still use the v1 DDL commands
> to operate on the tables inside it. This is also why the third-party
> catalogs (e.g. Unity Catalog and Apache Polaris) can not be used to
> overwrite `spark_catalog` if people still want to use the Spark built-in
> file sources.
>
> Technically, I think it's wrong for a third-party catalog to rely on
> Spark's session catalog without extending `DelegatingCatalogExtension`, as
> it confuses Spark. If it has its own metastore, then it shouldn't delegate
> requests to the Spark session catalog and use v1 DDL commands which only
> work with the Spark session catalog. Otherwise, it should extend
> `DelegatingCatalogExtension` to indicate it.
>
> On Mon, Sep 23, 2024 at 11:19 AM Manu Zhang 
> wrote:
>
>> Hi Iceberg and Spark community,
>>
>> I'd like to bring your attention to a recent change[1] in Spark 3.5.3
>> that effectively breaks Iceberg's SparkSessionCatalog[2] and blocks Iceberg
>> upgrading to Spark 3.5.3[3].
>>
>> SparkSessionCatalog, as a customized Spark V2 session catalog,
>> supports creating a V1 table with V1 command. That's no longer allowed
>> after the change unless it extends DelegatingCatalogExtension. It is not
>> minor work since SparkSessionCatalog already extends a base class[4].
>>
>> To resolve this issue, we have to make changes to public interfaces at
>> either Spark or Iceberg side. IMHO, it doesn't make sense for a downstream
>> project to refactor its interfaces when bumping up a maintenance version of
>> Spark. WDYT?
>>
>>
>> 1. https://github.com/apache/spark/pull/47724
>> 2.
>> https://iceberg.apache.org/docs/nightly/spark-configuration/#replacing-the-session-catalog
>> 3. https://github.com/apache/iceberg/pull/11160
>> 
>> 4.
>> https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/SparkSessionCatalog.java
>>
>> Thanks,
>> Manu
>>
>>

V3 Spec Changes

2024-09-24 Thread Russell Spitzer

Hi y’all!

I’m excited to say that we have a lot of great Iceberg V3 Spec PR’s out
right now. V3 Looks like it’s going to be awesome!

A reminder if you haven’t had a chance yet to check them out:

Row Lineage
Materialized Views
Geometric Types
Type Promotion and Variant
Type

I’m hoping we can get
consensus on all of these ASAP so we can start working more on
implementations and getting closer to the Iceberg 1.7 release (not that
these are blockers I’m just hoping we can have some functionality ready.)
So if you are interested in any of these please check them out and add any
comments if you have any or just leave a note saying you are in on board.

I also know we should have one additional PR for the new Delete File spec
changes coming soon!

Thanks everyone for your time,
Russ

Re: [DISCUSS] Column to Column filtering

2024-09-18 Thread Russell Spitzer

I have similar concerns to Ryan although I could see that if we were
writing smaller and better correlated files that this could be a big help.
Specifically with variant use cases this may be very useful. I would love
to hear more about the use cases and rationale for adding this. Do you have
any specific examples you can go into detail on?

On Wed, Sep 18, 2024 at 4:48 PM rdb...@gmail.com  wrote:

> I'm curious to learn more about this feature. Is there a driving use case
> that you're implementing it for? Are there common situations in which these
> filters are helpful and selective?
>
> My initial impression is that this kind of expression would have limited
> utility at the table format level. Iceberg tracks column ranges for data
> files and the primary use case for filtering is to skip data files at the
> scan planning phase. For a column-to-column comparison, you would only be
> able to eliminate data files that have non-overlapping ranges. That is, if
> you're looking for rows where x < y, you can only eliminate a file when
> max(x) < min(y). To me, it seems unlikely that this would be generic enough
> to be worth it, but if there are use cases where this can happen and speed
> up queries I think it may make sense.
>
> Ryan
>
> On Tue, Sep 17, 2024 at 6:21 AM Baldwin, Jennifer
>  wrote:
>
>> I’m starting a thread to discuss a feature for comparisons using column
>> references on the left and right side of an expression wherever iceberg
>> supports column reference to literal value(s) comparisons.  The use case we
>> want to support is filtering of date columns from a single table.  For
>> instance:
>>
>>
>>
>> select * from travel_table
>>
>> where expected_date > travel_date;
>>
>>
>>
>> select * from travel_table
>>
>> where payment_date <>  due_date;
>>
>>
>>
>>
>>
>> The changes will impact row and scan file filtering.  Impacted jars are
>> iceberg-api, iceberg-core, iceberg-orc and iceberg-parquet.
>>
>>
>>
>> Is this a feature the Iceberg community would be willing to accept?
>>
>>
>>
>> Here is a link to a Draft PR with current changes, Thanks.
>>
>> https://github.com/apache/iceberg/pull/11152
>>
>>
>>
>

Re: [Discuss] test logging is broken and Avro 1.12.0 upgraded slf4j-api dep to 2.x

2024-09-16 Thread Russell Spitzer

Sounds reasonable to me to just go to 2.x

On Mon, Sep 16, 2024 at 1:10 PM rdb...@gmail.com  wrote:

> If I understand the SLF4J announcement correctly, it sounds like the best
> option is to rely on binary compatibility between the 1.x and 2.x clients.
>
> As long as we don't use the newer API, then the compiled code can use
> either a 1.7.x or 2.0.x API Jar. The API Jar needs to match the provider
> version, so it should be supplied by downstream code instead of Iceberg.
> Luckily, it doesn't look like we ship the SLF4J API in our runtime
> binaries, so we already have a situation where downstream projects can
> choose the SLF4J version of both the API and provider Jars.
>
> In that case, I think the best path forward is to upgrade to 2.x but not
> use the new API features that will cause problems if downstream libraries
> are not already on 2.x.
>
> Does that sound reasonable?
>
> On Wed, Sep 11, 2024 at 11:17 AM Steven Wu  wrote:
>
>> Following up on the discussion from the community sync meeting. Right
>> now, Iceberg test code is in the 3rd row in the table pasted below. With
>> the recent Avro 1.12 upgrade (and slf4j 2.x), the main code is also
>> affected. That means downstream applications (Spark, Trino, Flink, ...) may
>> run into the same situation when upgrading to the next Iceberg 1.7 release.
>>
>> [image: image.png]
>>
>> From the slf4j doc, it seems that *slf4j 2.x API is backward compatible
>> as long as provider/binding is updated to 2.x too*.
>> https://www.slf4j.org/faq.html#changesInVersion200
>>
>> We have two options
>> 1. Exclude the slf4j 2.x transitive dependency from Avro 1.12. but we
>> would need to be comprehensive on forcing the slf4j dependency version
>> resolution to 1.x everywhere until Iceberg is ready to move forward to
>> slf4j 2.x.
>> 2. Move forward with slf4j 2.x now with a clear callout in the 1.7
>> release notes.
>>
>> First option is the more conservative approach and defer the slf4j 2.x
>> upgrade decision to the downstream applications. But it will add a little
>> burden to make sure future dependency upgrades don't bring in slf4j 2.x
>> transitively. Maybe we can leverage the force resolution from Gradle?
>>
>> configurations.all {
>>   resolutionStrategy {
>> force 'org.slf4j:slf4j-api:1.7.35'
>>   }
>> }
>>
>>
>> Thanks,
>> Steven
>>
>> SLF4J 2.0 stable release was announced *2 years ago*:
>> https://mailman.qos.ch/pipermail/announce/2022/000176.html. It explained
>> binary compatibility.
>>
>> "Mixing different versions of slf4j-api.jar and SLF4J provider can cause
>> problems. For example, if you are usingslf4j-api-2.0.0.jar, then you should
>> also use slf4j-simple-2.0.0.jar, using slf4j-simple-1.5.5.jar will not
>> work."
>>
>> More notes from slf4j FAQ page.
>>
>> "SLF4J 2.0.0 incorporates an optional fluent api
>> . Otherwise, there are no
>> client facing API changes in 2.0.x. For most users, upgrading to version
>> 2.0..x should be a drop-in replacement, as long as the logging provider is
>> updated as well.
>>
>> From the clients perspective, the SLF4J API, more specifically the
>> org.slf4j package, is backward compatible for all versions. This means
>> that you can upgrade from SLF4J version 1.0 to any later version without
>> problems. Code compiled with *slf4j-api-versionN.jar* will work with
>> *slf4j-api-versionM.jar* for any versionN and any versionM. *To date,
>> binary compatibility in slf4j-api has never been broken."*
>>
>>
>> On Mon, Sep 9, 2024 at 9:22 AM Steven Wu  wrote:
>>
>>> Bump the thread to bring the awareness of the issue and implication of
>>> slf4j 2.x upgrade.
>>>
>>> On Mon, Aug 26, 2024 at 12:24 PM Steve Zhang
>>>  wrote:
>>>
 I believe dependabot tried to upgrade self4j to 2.x in [1] but JB
 mentioned there's -1 on this upgrade, maybe he has more context.

 [1]https://github.com/apache/iceberg/pull/9688

 Thanks,
 Steve Zhang



 On Aug 24, 2024, at 7:37 PM, Steven Wu  wrote:

 Hi,

 It seems that test logging is broken in many modules (like core, flink)
 because slf4j-api was upgraded to 2.x while slf4j-simple provider is still
 on 1.7. I created a PR that upgraded slf4j-simple testImplementation to 2.x
 for all subprojects.

 https://github.com/apache/iceberg/pull/11001

 That fixed the test logging problem (e.g. TestInMemoryCatalog). You
 can find more details in the PR description. Test logging seems to have
 been broken for a while (from 1.4). But those dep problems have been for 
 *test
 runtime only*.

 Recent change [1] on Avro 1.12.0 introduced slf4j-api 2.x change for
 runtime, as verified by the cmd below on the *main branch*.

 ./gradlew -q :iceberg-core:dependencyInsight --dependency slf4j-api
 --configuration runtimeClasspath

 This thread is to raise awareness on the slf4j-api dep change to 2.x as
 downstream projects/appli

Re: [DISCUSS] Row Lineage Proposal

2024-09-16 Thread Russell Spitzer

One for each Table Version? Maybe worth thinking about going forwards. We a
little discussion about this at the community sync up last weds and the
general consensus is we just keep doing things the way we are doing them
until it becomes too unwieldy, then figure out a new solution. Feel free to
start up another thread though, it's worth thinking about.

On Sat, Sep 14, 2024 at 12:43 AM Manu Zhang  wrote:

> Thanks Russel. Not a question on the proposal itself, I find it a bit hard
> to follow and maintain all the three specs in one place. We are also
> publishing a unfinalized spec to the website. Would it be better to
> maintain the spec in a "copy-on-write" style, i.e. each spec having its own
> format file?
>
> Sorry to go off topic, I can start a separate thread if you think this
> concern is valid.
>
>
> On Sat, Sep 14, 2024 at 6:33 AM Russell Spitzer 
> wrote:
>
>> Pull Request Available, please focus any remaining comments there and we
>> can wrap this one up
>>
>> https://github.com/apache/iceberg/pull/11130
>>
>> On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com 
>> wrote:
>>
>>> +1 for making row lineage and equality deletes mutually exclusive.
>>>
>>> The idea behind equality deletes is to avoid needing to read existing
>>> data in order to delete records. That doesn't fit with row lineage because
>>> the purpose of lineage is to be able to identify when a row changes by
>>> maintaining an identifier that would have to be read.
>>>
>>> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi 
>>> wrote:
>>>
>>>> I went through the proposal and left comments as well. Thanks for
>>>> working on it, Russell!
>>>>
>>>> I don't see a good solution to how row lineage can work with equality
>>>> deletes. If so, I would be in favor of not allowing equality deletes at all
>>>> if row lineage is enabled as opposed to treating all added data records as
>>>> new. I will spend more time thinking if we can make it work.
>>>>
>>>> - Anton
>>>>
>>>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue 
>>>> пише:
>>>>
>>>>> Sounds good to me. Thanks for pushing this forward, Russell!
>>>>>
>>>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I think folks have had a lot of good comments and since there haven't
>>>>>> been a lot of strong opinions I'm going to try to take what I think are 
>>>>>> the
>>>>>> least interesting options and move them into the "discarded section".
>>>>>> Please continue to comment and let's please make sure any things that 
>>>>>> folk
>>>>>> think are blockers for a Spec PR are eliminated. If we have general
>>>>>> consensus at a high level I think we can move to discussing the actual 
>>>>>> spec
>>>>>> changes on a spec change PR.
>>>>>>
>>>>>> I'm going to be keeping the proposals for :
>>>>>>
>>>>>> Global Identifier as the Identifier
>>>>>> and
>>>>>> Last Updated Sequence number as the Version
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> The situation in which you would use equality deletes is when you do
>>>>>>> not want to read the existing table data. That seems at odds with a 
>>>>>>> feature
>>>>>>> like row-level tracking where you want to keep track. To me, it would 
>>>>>>> be a
>>>>>>> reasonable solution to just say that equality deletes can't be used in
>>>>>>> tables where row-level tracking is enabled.
>>>>>>>
>>>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>>>>>> russell.spit...@gmail.com> wrote:
>>>>>>>
>>>>>>>> As far as I know Flink is actually the only engine we have at the
>>>>>>>> moment that can produce Equally deletes and only Equality deletes have 
>>>>>>>> this
>>>>>>>> specific problem. Since an equality delete can be written without 
>>>>>>>

Re: [DISCUSS] Row Lineage Proposal

2024-09-13 Thread Russell Spitzer

Pull Request Available, please focus any remaining comments there and we
can wrap this one up

https://github.com/apache/iceberg/pull/11130

On Thu, Aug 29, 2024 at 11:20 AM rdb...@gmail.com  wrote:

> +1 for making row lineage and equality deletes mutually exclusive.
>
> The idea behind equality deletes is to avoid needing to read existing data
> in order to delete records. That doesn't fit with row lineage because the
> purpose of lineage is to be able to identify when a row changes by
> maintaining an identifier that would have to be read.
>
> On Wed, Aug 28, 2024 at 4:16 PM Anton Okolnychyi 
> wrote:
>
>> I went through the proposal and left comments as well. Thanks for working
>> on it, Russell!
>>
>> I don't see a good solution to how row lineage can work with equality
>> deletes. If so, I would be in favor of not allowing equality deletes at all
>> if row lineage is enabled as opposed to treating all added data records as
>> new. I will spend more time thinking if we can make it work.
>>
>> - Anton
>>
>> ср, 28 серп. 2024 р. о 12:41 Ryan Blue 
>> пише:
>>
>>> Sounds good to me. Thanks for pushing this forward, Russell!
>>>
>>> On Tue, Aug 27, 2024 at 7:17 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I think folks have had a lot of good comments and since there haven't
>>>> been a lot of strong opinions I'm going to try to take what I think are the
>>>> least interesting options and move them into the "discarded section".
>>>> Please continue to comment and let's please make sure any things that folk
>>>> think are blockers for a Spec PR are eliminated. If we have general
>>>> consensus at a high level I think we can move to discussing the actual spec
>>>> changes on a spec change PR.
>>>>
>>>> I'm going to be keeping the proposals for :
>>>>
>>>> Global Identifier as the Identifier
>>>> and
>>>> Last Updated Sequence number as the Version
>>>>
>>>>
>>>>
>>>> On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue 
>>>> wrote:
>>>>
>>>>> The situation in which you would use equality deletes is when you do
>>>>> not want to read the existing table data. That seems at odds with a 
>>>>> feature
>>>>> like row-level tracking where you want to keep track. To me, it would be a
>>>>> reasonable solution to just say that equality deletes can't be used in
>>>>> tables where row-level tracking is enabled.
>>>>>
>>>>> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> As far as I know Flink is actually the only engine we have at the
>>>>>> moment that can produce Equally deletes and only Equality deletes have 
>>>>>> this
>>>>>> specific problem. Since an equality delete can be written without 
>>>>>> actually
>>>>>> knowing whether rows are being updated or not, it is always ambiguous as 
>>>>>> to
>>>>>> whether a new row is an updated row, a newly added row, or a row which 
>>>>>> was
>>>>>> deleted but then a newly added row was also appended.
>>>>>>
>>>>>> I think in this case we need to ignore row_versioning and just give
>>>>>> every new row a brand new identifier. For a reader this means all updates
>>>>>> look like a "delete" and "add" and no "updates". For other processes (COW
>>>>>> and Position Deletes) we only mark records as being deleted or updated
>>>>>> after finding them first, this makes it easy to take the lineage 
>>>>>> identifier
>>>>>> from the source record and change it. For Spark, we just kept working on
>>>>>> engine improvements (like SPJ, Dynamic partition pushdown) to try to make
>>>>>> that scan and join faster but we probably still require a bit slower
>>>>>> latency.
>>>>>>
>>>>>> I think we could theoretically resolve equality deletes into updates
>>>>>> at compaction time again but only if the user first defines accurate "row
>>>>>> identity" columns because otherwise we have no way of determining whether
>>>>>> rows were updated

Re: [DISCUSS] Improving Position Deletes in V3

2024-09-11 Thread Russell Spitzer

+1 I am on board, all of my doubts are resolved

On Thu, Sep 5, 2024 at 3:42 AM Jean-Baptiste Onofré  wrote:

> Hi Anton
>
> Sorry for the late reply on this proposal.
> I like it ! It looks good to me (I have a few minor comments).
>
> It would be great to include this spec update in V3.
> Please let me know if I can help on this !
>
> Regards
> JB
>
> On Wed, Aug 21, 2024 at 11:28 PM Anton Okolnychyi 
> wrote:
> >
> > Hey folks,
> >
> > As discussed during the sync, I've been working on a proposal to improve
> the handling of position deletes in V3. It builds on lessons learned from
> deploying the current approach at scale and addresses all unresolved
> questions from past community discussions and proposals around this topic.
> >
> > In particular, the proposal attempts to address the following
> shortcomings we observe today:
> >
> > Choosing between fewer delete files on disk or targeted deletes.
> > Dependence on external maintenance for consistent write and read
> performance.
> > Writing and reading overhead as in-memory and on-disk representations
> differ.
> >
> > Please, take a look at the doc [1] and let me know what you think. Any
> feedback is highly appreciated!
> >
> > - Anton
> >
> > [1] -
> https://docs.google.com/document/d/18Bqhr-vnzFfQk1S4AgRISkA_5_m5m32Nnc2Cw0zn2XM
> >
> >
>

Re: [DISCUSS] September board report

2024-09-11 Thread Russell Spitzer

This looks good to me

On Wed, Sep 11, 2024 at 12:35 AM Steven Wu  wrote:

> > Flink Range distribution for Sinks
>
> It is already included in Ryan's draft
>
> > Flink Source V2 improvements and V1 deprecation to prepare for Flink 2.0
>
> This is still ongoing. There is a blocking issue with FileIOParser on
> HadoopFileIO: https://github.com/apache/iceberg/pull/10926
>
>
> On Tue, Sep 10, 2024 at 10:06 PM Péter Váry 
> wrote:
>
>> Maybe mention some Flink ongoing tasks, improvements:
>> - Flink Range distribution for Sinks
>> - Flink Source V2 improvements and V1 deprecation to prepare for Flink 2.0
>> - Flink Sink V2 implementation to prepare for Flink 2.0
>> - Flink Table Maintenance (ongoing)
>>
>> Thanks for preparing this Ryan!
>> Peter
>>
>>
>> On Tue, Sep 10, 2024, 23:51 Matt Topol  wrote:
>>
>>> There's one additional point to add for the Go implementation, we
>>> implemented file scan planning. It returns the list of file scan tasks
>>> needed for a given table, partitions and filter expression.
>>>
>>> --Matt
>>>
>>> On Tue, Sep 10, 2024, 5:43 PM rdb...@gmail.com  wrote:
>>>
 Hi everyone,

 It’s time for another ASF board report! Here’s my current draft. Please
 reply if you think there is something that I should add or change. Thanks!

 Ryan
 Description:

 Apache Iceberg is a table format for huge analytic datasets that is
 designed
 for high performance and ease of use.
 Project Status:

 Current project status: Ongoing
 Issues for the board: None
 Membership Data:

 Apache Iceberg was founded 2020-05-19 (4 years ago)
 There are currently 31 committers and 21 PMC members in this project.
 The Committer-to-PMC ratio is roughly 4:3.

 Community changes, past quarter:

- Amogh Jahagirdar was added to the PMC on 2024-08-12
- Eduard Tudenhoefner was added to the PMC on 2024-08-12
- Honah J. was added to the PMC on 2024-07-22
- Renjie Liu was added to the PMC on 2024-07-22
- Peter Vary was added to the PMC on 2024-08-12
- Piotr Findeisen was added as committer on 2024-07-24
- Kevin Liu was added as committer on 2024-07-24
- Sung Yun was added as committer on 2024-07-24
- Hao Ding was added as committer on 2024-07-23

 Project Activity:

 Releases:

- Java 1.6.1 was released on 2024-08-28
- Rust 0.3.0 was released on 2024-08-20
- PyIceberg 0.7.1 was released on 2024-08-18
- PyIceberg 0.7.0 was released on 2024-07-30
- Java 1.6.0 was released on 2024-07-23

 Table format:

- Work for v3 is picking up
- Committed timestamp_ns implementation
- Ongoing discussion/proposal for improvements to row-level deletes
- Ongoing discussion/proposal for row-level metadata for change
tracking
- Discussion for adding variant type and where to maintain the spec
(Parquet)
- Making progress on geometry types
- Clarified transform requirements to add transforms as needed (to
support geo)
- Discovered issues affecting new type promotion cases, reduced
scope

 REST protocol specification:

- Added server-side scan planning
- Support for removing partition specs
- Support for endpoint discovery for future additions
- Clarified failure requirements for unknown actions or validations

 Java:

- Added classes for v3 table writes
- Fixed rewrites in tables with 1000+ columns
- Added Kafka Connect runtime bundle
- Support for Flink 1.20
- Added range distribution support in Flink
- Dropped support for Java 8

 PyIceberg:

- Discussed adding a dependency on iceberg-rust for native
extensions
- Write support for time and identity transforms
- Parallelized large writes
- Support for deletes using filter predicates
- Staged table creation for atomic CTAS
- Support manifest merging on write
- Better integration with PyArrow to produce lazy readers from scans
- New API to add existing Parquet files
- Support custom catalogs

 Rust:

- Established subproject pyiceberg_core to support PyIceberg
- Implemented OAuth for catalog REST client
- Added Parquet writer and reader capabilities with support for
data projection.
- Introduced memory catalog and memory file IO support
- Initialized SQL Catalog
- Added support for GCS storage and AWS session tokens
- Implemented concurrent table scans and data file fetching
- Enhanced predicate builders and expression evaluators
- Added support for timestamp columns in row filters

 Go:

- Implemented expressions and expression visitors

 Community Health:

>>>

Re: [DISCUSS] Drop Hive 2 support

2024-09-09 Thread Russell Spitzer

+1

On Mon, Sep 9, 2024 at 7:59 AM Eduard Tudenhöfner 
wrote:

> +1 on deprecating Hive 2 in Iceberg 1.7 and removing it in 1.8
>
> On Mon, Sep 2, 2024 at 8:53 AM Jean-Baptiste Onofré 
> wrote:
>
>> It sounds good.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On Mon, Sep 2, 2024 at 5:06 AM Manu Zhang 
>> wrote:
>> >
>> > Thanks JB. I think the last Hive 2 support will be Iceberg 1.7.x
>> although we don't recommend users to use a deprecated version.
>> >
>> > Regards,
>> > Manu
>> >
>> > On Fri, Aug 30, 2024 at 8:12 PM Jean-Baptiste Onofré 
>> wrote:
>> >>
>> >> Hi Manu
>> >>
>> >> Your plan makes sense to me (deprecated Hive 2 in 1.7 and remove in
>> 1.8).
>> >> Engines still depending to Hive 2 will have to use Iceberg <= 1.6,
>> which is OK.
>> >>
>> >> Regards
>> >> JB
>> >>
>> >> On Tue, Aug 27, 2024 at 2:34 AM Manu Zhang 
>> wrote:
>> >> >
>> >> > Hi all,
>> >> >
>> >> > I'd like to start a discussion on dropping Hive 2 support, which
>> reached EOL three months ago[1]. It's also a prerequisite for migration to
>> Hadoop 3, as shown by Steve's PR[2]. For your reference, I have a draft
>> PR[3] to show the needed changes.
>> >> >
>> >> > Since we've not deprecated Hive 2 support yet, I think the path is
>> to deprecate Hive 2 in 1.7 and drop Hive 2 in 1.8. What do you think?
>> >> >
>> >> > [1] https://hive.apache.org/general/downloads/
>> >> > [2] https://github.com/apache/iceberg/pull/10932
>> >> > [3] https://github.com/apache/iceberg/pull/10996
>> >> >
>> >> > Regards,
>> >> > Manu
>>
>

Re: [DISCUSS] Row Lineage Proposal

2024-08-27 Thread Russell Spitzer

I think folks have had a lot of good comments and since there haven't been
a lot of strong opinions I'm going to try to take what I think are the
least interesting options and move them into the "discarded section".
Please continue to comment and let's please make sure any things that folk
think are blockers for a Spec PR are eliminated. If we have general
consensus at a high level I think we can move to discussing the actual spec
changes on a spec change PR.

I'm going to be keeping the proposals for :

Global Identifier as the Identifier
and
Last Updated Sequence number as the Version



On Tue, Aug 20, 2024 at 3:21 AM Ryan Blue 
wrote:

> The situation in which you would use equality deletes is when you do not
> want to read the existing table data. That seems at odds with a feature
> like row-level tracking where you want to keep track. To me, it would be a
> reasonable solution to just say that equality deletes can't be used in
> tables where row-level tracking is enabled.
>
> On Mon, Aug 19, 2024 at 11:34 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> As far as I know Flink is actually the only engine we have at the moment
>> that can produce Equally deletes and only Equality deletes have this
>> specific problem. Since an equality delete can be written without actually
>> knowing whether rows are being updated or not, it is always ambiguous as to
>> whether a new row is an updated row, a newly added row, or a row which was
>> deleted but then a newly added row was also appended.
>>
>> I think in this case we need to ignore row_versioning and just give every
>> new row a brand new identifier. For a reader this means all updates look
>> like a "delete" and "add" and no "updates". For other processes (COW and
>> Position Deletes) we only mark records as being deleted or updated after
>> finding them first, this makes it easy to take the lineage identifier from
>> the source record and change it. For Spark, we just kept working on engine
>> improvements (like SPJ, Dynamic partition pushdown) to try to make that
>> scan and join faster but we probably still require a bit slower latency.
>>
>> I think we could theoretically resolve equality deletes into updates at
>> compaction time again but only if the user first defines accurate "row
>> identity" columns because otherwise we have no way of determining whether
>> rows were updated or not. This is basically the issue we have now in the
>> CDC procedures. Ideally, I think we need to find a way to have flink locate
>> updated rows at runtime using some better indexing structure or something
>> like that as you suggested.
>>
>> On Sat, Aug 17, 2024 at 1:07 AM Péter Váry 
>> wrote:
>>
>>> Hi Russell,
>>>
>>> As discussed offline, this would be very hard to implement with the
>>> current Flink CDC write strategies. I think this is true for every
>>> streaming writers.
>>>
>>> For tracking the previous version of the row, the streaming writer would
>>> need to scan the table. It needs to be done for every record to find the
>>> previous version. This could be possible if the data would be stored in a
>>> way which supports fast queries on the primary key, like LSM Tree (see:
>>> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for
>>> higher loads. So adding a new storage strategy could be one solution.
>>>
>>> Alternatively we might find a way for the compaction to update the
>>> lineage fields. We could provide a way to link the equality deletes to the
>>> new rows which updated them during write, then on compaction we could
>>> update the lineage fields based on this info.
>>>
>>> Is there any better ideas with Spark streaming which we can adopt?
>>>
>>> Thanks,
>>> Peter
>>>
>>> [1] - https://paimon.apache.org/docs/0.8/
>>>
>>> On Sat, Aug 17, 2024, 01:06 Russell Spitzer 
>>> wrote:
>>>
>>>> Hi Y'all,
>>>>
>>>> We've been working on a new proposal to add Row Lineage to Iceberg in
>>>> the V3 Spec. The general idea is to give every row a unique identifier as
>>>> well as a marker of what version of the row it is. This should let us build
>>>> a variety of features related to CDC, Incremental Processing and Audit
>>>> Logging. If you are interested please check out the linked proposal below.
>>>> This will require compliance from all engines to be really useful so It's
>>>> important we come to consensus on whether or not this is possible.
>>>>
>>>>
>>>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>>>
>>>>
>>>> Thank you for your consideration,
>>>> Russ
>>>>
>>>
>
> --
> Ryan Blue
> Databricks
>

Re: [VOTE] REST Endpoint discovery

2024-08-20 Thread Russell Spitzer

+1

On Tue, Aug 20, 2024 at 2:32 PM Walaa Eldin Moustafa 
wrote:

> +1 non-biding
>
> Thanks for driving this Eduard.
>
> On Tue, Aug 20, 2024 at 12:17 PM Daniel Weeks  wrote:
>
>> +1
>>
>> On Tue, Aug 20, 2024 at 11:19 AM Yufei Gu  wrote:
>>
>>> +1
>>>
>>> Yufei
>>>
>>>
>>> On Tue, Aug 20, 2024 at 11:16 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 Hey everyone,

 I'd like to vote on PR #10928
  which adds a way for
 REST servers to communicate to clients what endpoints it supports via a new
 *endpoints* field in the *CatalogConfig* of the *v1/config* endpoint.

 Discuss thread can be found here:
 https://lists.apache.org/thread/8h86382omdx9cmvc15m2bf361p5rz4rk

 The vote will be open for at least 72 hours. Please vote with

 [ ] +1
 [ ] -0
 [ ] -1 do not make these changes because ...

>>>

Re: Community sync

2024-08-20 Thread Russell Spitzer

Copied from Calendar Invite


https://www.google.com/url?q=https://meet.google.com/ujy-njjo-vre&sa=D&source=calendar&ust=1724601702621697&usg=AOvVaw1Rd0-RlNwoXE-OIDTVtAGC
Triweekly Iceberg meeting for anyone wanting to get involved in the Iceberg
development, documentation, or hear about the roadmap! Please remember that
anyone is welcome to join these community syncs so please feel free to share
the calendar on the Iceberg community page

.

A few things worth noting:

   - The guiding agenda will be listed in the attached Iceberg Community
   Sync
   

   docs prior to the start of the meeting and everyone is welcome to add items
   there (A reminder will be sent out the prior day)
   - Meetings will be recorded and uploaded to the Iceberg YouTube channel
   
.
   A link will be included in the meeting minutes email sent to the dev list
   as well as in the sync docs
   - Each meeting will start with 5-10 minutes to cover any highlights
   since the last sync. These highlights will be included in the doc as well
   and likewise feel free to add items there, such as welcoming new community
   members! :)
   - If conversations are pinned for further off-line discussions, this
   will be noted in the meeting minutes and should be updated with any
   conclusions/decisions (feel free to just ping the general slack channel
   with an update and I'll make sure it gets updated in the minutes)
   - Meetings will be held every three weeks. Any changes to the schedule
   will include notifying the community dev list.


Besides that, let's all have fun and keep a welcoming environment!

On Tue, Aug 20, 2024 at 11:38 AM Lessard, Steve
 wrote:

> Based on previous emails in this list I got the impression that a
> community sync is coming up, possibly tomorrow morning. Where can I fing
> the meeting information so that I may listen in?
>
>
>
> -Steve Lessard, Teradata
>

Re: [DISCUSS] Release source and binary verification

2024-08-20 Thread Russell Spitzer

I think these are reasonable to add, we probably should also verify there
are no binaries of any kind in the release tarball. Sometimes builds
accidentally leak these.

On Tue, Aug 20, 2024 at 8:36 AM Piotr Findeisen 
wrote:

> Hi All,
>
> Hi
>
> The release verification [1] includes testing release source tarball
> builds and also testing the binaries with downstream projects.
>
> Does it also contain, should it contain or is it a conscious omission of:
>
> 1. verifying the source tarball is what it should be (source matches the
> git repo state)
> 2. verifying the binaries are what should be built from the source
> ("repeatable builds")
>
> Best
> Piotr
>
> [1]
> https://iceberg.apache.org/how-to-release/#validating-a-source-release-candidate
> .
>
>

Re: [VOTE] Spec changes in preparation for v3

2024-08-19 Thread Russell Spitzer

+1 - Feels duplicative to vote here and approve on the PR

On Mon, Aug 19, 2024 at 2:41 PM Ryan Blue  wrote:

> Hi everyone,
>
> I'd like to vote on PR #10948
> , which has some spec
> changes to prepare for v3:
>
> * Add a high-level v3 summary (only changes already in the spec)
> * Clarify the existing v3 requirement for handling specs with unknown
> transforms
> * Reset heading levels that were set to de-clutter the TOC in previous
> site frameworks
>
> This will be open for at least 72 hours.
>
> [ ] +1
> [ ] -0
> [ ] -1 do not make these changes because . . .
>
> --
> Ryan Blue
>

Re: [DISCUSS] Row Lineage Proposal

2024-08-19 Thread Russell Spitzer

As far as I know Flink is actually the only engine we have at the moment
that can produce Equally deletes and only Equality deletes have this
specific problem. Since an equality delete can be written without actually
knowing whether rows are being updated or not, it is always ambiguous as to
whether a new row is an updated row, a newly added row, or a row which was
deleted but then a newly added row was also appended.

I think in this case we need to ignore row_versioning and just give every
new row a brand new identifier. For a reader this means all updates look
like a "delete" and "add" and no "updates". For other processes (COW and
Position Deletes) we only mark records as being deleted or updated after
finding them first, this makes it easy to take the lineage identifier from
the source record and change it. For Spark, we just kept working on engine
improvements (like SPJ, Dynamic partition pushdown) to try to make that
scan and join faster but we probably still require a bit slower latency.

I think we could theoretically resolve equality deletes into updates at
compaction time again but only if the user first defines accurate "row
identity" columns because otherwise we have no way of determining whether
rows were updated or not. This is basically the issue we have now in the
CDC procedures. Ideally, I think we need to find a way to have flink locate
updated rows at runtime using some better indexing structure or something
like that as you suggested.

On Sat, Aug 17, 2024 at 1:07 AM Péter Váry 
wrote:

> Hi Russell,
>
> As discussed offline, this would be very hard to implement with the
> current Flink CDC write strategies. I think this is true for every
> streaming writers.
>
> For tracking the previous version of the row, the streaming writer would
> need to scan the table. It needs to be done for every record to find the
> previous version. This could be possible if the data would be stored in a
> way which supports fast queries on the primary key, like LSM Tree (see:
> Paimon [1]), otherwise it would be prohibitively costly, and unfeasible for
> higher loads. So adding a new storage strategy could be one solution.
>
> Alternatively we might find a way for the compaction to update the lineage
> fields. We could provide a way to link the equality deletes to the new rows
> which updated them during write, then on compaction we could update the
> lineage fields based on this info.
>
> Is there any better ideas with Spark streaming which we can adopt?
>
> Thanks,
> Peter
>
> [1] - https://paimon.apache.org/docs/0.8/
>
> On Sat, Aug 17, 2024, 01:06 Russell Spitzer 
> wrote:
>
>> Hi Y'all,
>>
>> We've been working on a new proposal to add Row Lineage to Iceberg in the
>> V3 Spec. The general idea is to give every row a unique identifier as well
>> as a marker of what version of the row it is. This should let us build a
>> variety of features related to CDC, Incremental Processing and Audit
>> Logging. If you are interested please check out the linked proposal below.
>> This will require compliance from all engines to be really useful so It's
>> important we come to consensus on whether or not this is possible.
>>
>>
>> https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing
>>
>>
>> Thank you for your consideration,
>> Russ
>>
>

[DISCUSS] Row Lineage Proposal

2024-08-16 Thread Russell Spitzer

Hi Y'all,

We've been working on a new proposal to add Row Lineage to Iceberg in the
V3 Spec. The general idea is to give every row a unique identifier as well
as a marker of what version of the row it is. This should let us build a
variety of features related to CDC, Incremental Processing and Audit
Logging. If you are interested please check out the linked proposal below.
This will require compliance from all engines to be really useful so It's
important we come to consensus on whether or not this is possible.

https://docs.google.com/document/d/146YuAnU17prnIhyuvbCtCtVSavyd5N7hKryyVRaFDTE/edit?usp=sharing


Thank you for your consideration,
Russ

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer

I support that whole-heartedly. Parquet would be a great neutral location
for the spec.

On Thu, Aug 15, 2024 at 1:17 PM Ryan Blue 
wrote:

> I think it's a good idea to reach out to the Spark community and make sure
> we are in agreement. Up until now I think we've been thinking more
> abstractly about what makes sense but before we make any decision we should
> definitely collaborate with the other communities.
>
> I'd also like to suggest an alternative for where this spec should be
> maintained that would hopefully allow us to avoid copying and maintaining
> multiple places. As we've already discussed, this is not an easy spec to
> find a home for because there are alternative projects that are all
> interested. Since this is a cross-engine type, Spark may not be ideal. At
> the same time, Delta already supports the variant spec so there's a similar
> problem maintaining this in Iceberg.
>
> I think that a reasonable and neutral option is to see if the Parquet
> community would be willing to host the spec and library. That fits with the
> spec because subcolumnarization is written assuming Parquet is the storage.
> It would also be the best place for broad compatibility because anyone
> using Parquet would have a strong motivation to standardize on the same
> encoding.
>
> Initially, I pushed for Iceberg instead of Parquet because we may want to
> have the same variant encoding in ORC, but what made me change my mind is
> that every layer (file format, table format, engine) has that problem and
> I've heard the concern about neutrality raised multiple times while
> discussing this question internally.
>
> I think the Parquet community is the most neutral option available. Would
> anyone else support asking the Spark and Parquet communities to maintain
> the variant spec in Parquet?
>
> Ryan
>
> On Thu, Aug 15, 2024 at 8:34 AM Xuanwo  wrote:
>
>> From the iceberg-rust perspective, it could be extremely challenging to
>> keep track of both the Spark and Iceberg specifications. Having a single
>> source of truth would be much better. I believe this change will also
>> benefit Delta Lake if they implement the same approach. Perhaps we can try
>> contacting them to initiate such a project?
>>
>> On Thu, Aug 15, 2024, at 23:17, Gang Wu wrote:
>>
>> +1 on posting this discussion to dev@spark ML
>>
>> > I don't think there is anything that would stop us from moving to a
>> joint project in the future
>>
>> My concern is that if we don't do this from day 1, we will never ever do
>> this.
>>
>> Best,
>> Gang
>>
>> On Thu, Aug 15, 2024 at 11:08 PM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> Thats fair @Micah, so far all the discussions have been direct and off
>> the dev list. Would you like to make the request on the public Spark Dev
>> list? I would be glad to co-sign, I can also draft up a quick email if you
>> don't have time.
>>
>> On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield 
>> wrote:
>>
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>>
>>
>> I just wanted to double check that these issues were brought directly to
>> the spark community (i.e. a discussion thread on the Spark developer
>> mailing list) and not via backchannels.
>>
>> I'm not sure the outcome would be different and I don't think this should
>> block forking the spec, but we should make sure that the decision is
>> publicly documented within both communities.
>>
>> Thanks,
>> Micah
>>
>> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>>
>> @Gang Wu
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>> I don't think there is anything that would stop us from moving to a joint
>> project in the future and if you know of some way of encouraging that
>> movement from other relevant parties I would be glad to collaborate in
>> doing that. One thing that I don't want to do is have the Iceberg project
>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>
>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu  wrote:
>>
>> I’m on board with copying the spec in

Re: [DISCUSS] REST Endpoint discovery

2024-08-15 Thread Russell Spitzer

I'm on board for this proposal. I was in the off-mail chats and I think
this is probably our simplest approach going forward.

On Thu, Aug 15, 2024 at 10:39 AM Dmitri Bourlatchkov
 wrote:

> OpenAPI tool will WARN a lot if Operation IDs overlap. Generated code/html
> may also look odd in case of overlaps.
>
> All-in-all, I think the best practice is to define unique Operation IDs up
> front.
>
> For Iceberg REST API, the yaml file is the API definition, so it should
> not be a problem to ensure that Operation IDs are unique, I guess.
>
> Cheers,
> Dmitri.
>
> On Thu, Aug 15, 2024 at 11:32 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Hey Jack,
>>
>> thanks for the feedback. I replied in the doc but I can reiterate my
>> answer here too: The *path* is unique and required so that feels more
>> appropriate than requiring to have an optional *operationId* in the
>> OpenAPI spec.
>> Additionally, using the path is more straight-forward when we introduce
>> v2 endpoints, while you would have to make sure that all *operationIds*
>> are unique across endpoints (and I'm not sure if OpenAPI tools actually
>> enforce uniqueness).
>>
>>
>>
>> On Thu, Aug 15, 2024 at 5:20 PM Jack Ye  wrote:
>>
>>> Hi Eduard,
>>>
>>> In general I agree with this proposal, thanks for putting this up! Just
>>> one question (which I also added in the design), what are the thoughts
>>> behind using " ", vs using the
>>> operationId defined in the OpenAPI?
>>>
>>> The operationId approach definitely looks much cleaner to me, but (1) in
>>> OpenAPI it is not a requirement to define it, and (2) right now there are
>>> some inconsistent operationIds, for example UpdateTable is the operationId,
>>> but CommitTable is used for all request and response models. But these are
>>> all pretty solvable issues because we can enforce operationId to be
>>> required in the Iceberg spec, and fix it to be consistent, assuming nobody
>>> is taking a dependency on these operationIds right now.
>>>
>>> Personally speaking, I am pretty neutral on this topic, but curious what
>>> everyone thinks.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>> On Wed, Aug 14, 2024 at 9:20 AM Eduard Tudenhöfner <
>>> etudenhoef...@apache.org> wrote:
>>>
 Hey Dmitri,

 this proposal is the result of the community feedback from the
 Capabilities proposal. Ultimately the capabilities turned out to entail
 more complexity than necessary and so this proposal solves the core problem
 while keeping complexity and spec changes to an absolute minimum.

 Eduard

 On Wed, Aug 14, 2024 at 5:15 PM Dmitri Bourlatchkov
  wrote:

> Hi Eduard,
>
> How is this proposal related to the Server Capabilities discussion?
>
> Thanks,
> Dmitri.
>
> On Wed, Aug 14, 2024 at 5:14 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> Hey everyone,
>>
>> I'd like to propose a way for REST servers to communicate to clients
>> what endpoints it supports via a new *endpoints* field in the
>> *CatalogConfig* of the *v1/config* endpoint.
>>
>> This enables clients to make better decisions and clearly signal that
>> a particular endpoint isn’t supported.
>>
>> I opened #10937  to
>> track the proposal in GH. Please find the proposal doc here
>> 
>>  (estimated
>> read time: 5 minutes). The proposal requires a Spec change, which can be
>> seen in #10928 .
>>
>>
>> Thanks,
>>
>> Eduard
>>
>

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer

Thats fair @Micah, so far all the discussions have been direct and off the
dev list. Would you like to make the request on the public Spark Dev list?
I would be glad to co-sign, I can also draft up a quick email if you don't
have time.

On Thu, Aug 15, 2024 at 10:04 AM Micah Kornfield 
wrote:

> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>
>
> I just wanted to double check that these issues were brought directly to
> the spark community (i.e. a discussion thread on the Spark developer
> mailing list) and not via backchannels.
>
> I'm not sure the outcome would be different and I don't think this should
> block forking the spec, but we should make sure that the decision is
> publicly documented within both communities.
>
> Thanks,
> Micah
>
> On Thu, Aug 15, 2024 at 7:47 AM Russell Spitzer 
> wrote:
>
>> @Gang Wu
>>
>> I agree that it would be beneficial to make a sub-project, the main
>> problem is political and not logistic. I've been asking for movement from
>> other relative projects for a month and we simply haven't gotten anywhere.
>> I don't think there is anything that would stop us from moving to a joint
>> project in the future and if you know of some way of encouraging that
>> movement from other relevant parties I would be glad to collaborate in
>> doing that. One thing that I don't want to do is have the Iceberg project
>> stay in a holding pattern without any clear roadmap as to how to proceed.
>>
>> On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu  wrote:
>>
>>> I’m on board with copying the spec into our repository. However, as
>>> we’ve talked about, it’s not just a straightforward copy—there are already
>>> some divergences. Some of them are under discussion. Iceberg is definitely
>>> the best place for these specs. Engines like Trino and Flink can then rely
>>> on the Iceberg specs as a solid foundation.
>>>
>>> Yufei
>>>
>>> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu  wrote:
>>>
>>>> Sorry for chiming in late.
>>>>
>>>> From the discussion in
>>>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>>>> don't quite understand why it is logistically complicated to create a
>>>> sub-project to hold the variant spec and impl.
>>>>
>>>> IMHO, coping the variant type spec into Apache Iceberg has some
>>>> deficiencies:
>>>> - It is a burden to update two repos if there is a variant type spec
>>>> change and will likely result in deviation if some changes do not reach
>>>> agreement from both parties.
>>>> - Implementers are required to keep an eye on both specs (considering
>>>> proprietary engines where both Iceberg and Delta are supported).
>>>> - Putting the spec and impl of variant type in Iceberg repo does lose
>>>> the opportunity for better native support from file formats like Parquet
>>>> and ORC.
>>>>
>>>> I'm not sure if it is possible to create a separate project (e.g.
>>>> apache/variant-type) to make it a single point of truth. We can learn from
>>>> the experience of Apache Arrow. In this fashion, different engines, table
>>>> formats and file formats can follow the same spec and are free to depend on
>>>> the reference implementations from apache/variant-type or implement their
>>>> own.
>>>>
>>>> Best,
>>>> Gang
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye  wrote:
>>>>
>>>>> +1 for copying the spec into our repository, I think we need to own it
>>>>> fully as a part of the table spec, and we can build compatibility through
>>>>> tests.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I'm not really in favor of linking and annotating as that just makes
>>>>>> things more complicated and still is essentially forking just with more
>>>>>> steps. If we just track our annotations / modifications  to a single
>>>>>> commit/version then we have the same issue again but now you have to go 
>>>>>>

Re: [DISCUSS] Variant Spec Location

2024-08-15 Thread Russell Spitzer

@Gang Wu

I agree that it would be beneficial to make a sub-project, the main problem
is political and not logistic. I've been asking for movement from other
relative projects for a month and we simply haven't gotten anywhere. I
don't think there is anything that would stop us from moving to a joint
project in the future and if you know of some way of encouraging that
movement from other relevant parties I would be glad to collaborate in
doing that. One thing that I don't want to do is have the Iceberg project
stay in a holding pattern without any clear roadmap as to how to proceed.

On Wed, Aug 14, 2024 at 11:12 PM Yufei Gu  wrote:

> I’m on board with copying the spec into our repository. However, as we’ve
> talked about, it’s not just a straightforward copy—there are already some
> divergences. Some of them are under discussion. Iceberg is definitely the
> best place for these specs. Engines like Trino and Flink can then rely on
> the Iceberg specs as a solid foundation.
>
> Yufei
>
> On Wed, Aug 14, 2024 at 7:51 PM Gang Wu  wrote:
>
>> Sorry for chiming in late.
>>
>> From the discussion in
>> https://lists.apache.org/thread/xcyytoypgplfr74klg1z2rgjo6k5b0sq, I
>> don't quite understand why it is logistically complicated to create a
>> sub-project to hold the variant spec and impl.
>>
>> IMHO, coping the variant type spec into Apache Iceberg has some
>> deficiencies:
>> - It is a burden to update two repos if there is a variant type spec
>> change and will likely result in deviation if some changes do not reach
>> agreement from both parties.
>> - Implementers are required to keep an eye on both specs (considering
>> proprietary engines where both Iceberg and Delta are supported).
>> - Putting the spec and impl of variant type in Iceberg repo does lose the
>> opportunity for better native support from file formats like Parquet and
>> ORC.
>>
>> I'm not sure if it is possible to create a separate project (e.g.
>> apache/variant-type) to make it a single point of truth. We can learn from
>> the experience of Apache Arrow. In this fashion, different engines, table
>> formats and file formats can follow the same spec and are free to depend on
>> the reference implementations from apache/variant-type or implement their
>> own.
>>
>> Best,
>> Gang
>>
>>
>>
>>
>> On Thu, Aug 15, 2024 at 10:07 AM Jack Ye  wrote:
>>
>>> +1 for copying the spec into our repository, I think we need to own it
>>> fully as a part of the table spec, and we can build compatibility through
>>> tests.
>>>
>>> -Jack
>>>
>>> On Wed, Aug 14, 2024 at 12:52 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> I'm not really in favor of linking and annotating as that just makes
>>>> things more complicated and still is essentially forking just with more
>>>> steps. If we just track our annotations / modifications  to a single
>>>> commit/version then we have the same issue again but now you have to go to
>>>> multiple sources to get the actual Spec. *In addition, our very copy
>>>> of the Spec is going to require new types which don't exist in the Spark
>>>> Spec which necessarily means diverging. *We will need to take up new
>>>> primitive id's (as noted in my first email)
>>>>
>>>> The other issue I have is I don't think the Spark Spec is really going
>>>> through a thorough review process from all members of the Spark community,
>>>> I believe it probably should have gone through the SPIP but instead seems
>>>> to have been merged without broad community involvement.
>>>>
>>>> The only way to truly avoid diverging is to only have a single copy of
>>>> the spec, in our previous discussions the vast majority of Apache Iceberg
>>>> community want it to exist here.
>>>>
>>>> On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks  wrote:
>>>>
>>>>> I'm really excited about the introduction of variant type to Iceberg,
>>>>> but I want to raise concerns about forking the spec.
>>>>>
>>>>> I feel like preemptively forking would create the situation where we
>>>>> end up diverging because there's little reason to work with both
>>>>> communities to evolve in a way that benefits everyone.
>>>>>
>>>>> I would much rather point to a specific version of the spec and
>>>>> annotate any variance in Iceberg&#

Re: [DISCUSS] Variant Spec Location

2024-08-14 Thread Russell Spitzer

I'm not really in favor of linking and annotating as that just makes things
more complicated and still is essentially forking just with more steps. If
we just track our annotations / modifications  to a single commit/version
then we have the same issue again but now you have to go to multiple
sources to get the actual Spec. *In addition, our very copy of the Spec is
going to require new types which don't exist in the Spark Spec which
necessarily means diverging. *We will need to take up new primitive id's
(as noted in my first email)

The other issue I have is I don't think the Spark Spec is really going
through a thorough review process from all members of the Spark community,
I believe it probably should have gone through the SPIP but instead seems
to have been merged without broad community involvement.

The only way to truly avoid diverging is to only have a single copy of the
spec, in our previous discussions the vast majority of Apache Iceberg
community want it to exist here.

On Wed, Aug 14, 2024 at 2:19 PM Daniel Weeks  wrote:

> I'm really excited about the introduction of variant type to Iceberg, but
> I want to raise concerns about forking the spec.
>
> I feel like preemptively forking would create the situation where we end
> up diverging because there's little reason to work with both communities to
> evolve in a way that benefits everyone.
>
> I would much rather point to a specific version of the spec and annotate
> any variance in Iceberg's handling.  This would allow us to continue
> without dividing the communities.
>
> If at any point there are irreconcilable differences, I would support
> forking, but I don't feel like that should be the initial step.
>
> No one is excited about the possibility that the physical representations
> end up diverging, but it feels like we're setting ourselves up for that
> exact scenario.
>
> -Dan
>
>
> On Wed, Aug 14, 2024 at 6:54 AM Fokko Driesprong  wrote:
>
>> +1 to what's already being said here. It is good to copy the spec to
>> Iceberg and add context that's specific to Iceberg, but at the same time,
>> we should maintain compatibility.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 14 aug 2024 om 15:30 schreef Manu Zhang :
>>
>>> +1 to copy the spec into our repository. I think the best way to keep
>>> compatibility is building integration tests.
>>>
>>> Thanks,
>>> Manu
>>>
>>> On Wed, Aug 14, 2024 at 8:27 PM Péter Váry 
>>> wrote:
>>>
>>>> Thanks Russell and Aihua for pushing Variant support!
>>>>
>>>> Given the differences between the supported types and the lack of
>>>> interest from the other project, I think it is reasonable to duplicate the
>>>> specification to our repository.
>>>> I would give very strong emphasis on sticking to the Spark spec as much
>>>> as possible, to keep compatibility as much as possible. Maybe even revert
>>>> to a shared specification if the situation changes.
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Aihua Xu  ezt írta (időpont: 2024. aug. 13., K,
>>>> 19:52):
>>>>
>>>>> Thanks Russell for bringing this up.
>>>>>
>>>>> This is the main blocker to move forward with the Variant support in
>>>>> Iceberg and hopefully we can have a consensus. To me, I also feel it makes
>>>>> more sense to move the spec into Iceberg rather than Spark engine owns it
>>>>> and we try to keep it compatible with Spark spec.
>>>>>
>>>>> Thanks,
>>>>> Aihua
>>>>>
>>>>> On Mon, Aug 12, 2024 at 6:50 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Hi Y’all,
>>>>>>
>>>>>> We’ve hit a bit of a roadblock with the Variant Proposal, while we
>>>>>> were hoping to move the Variant and Shredding specifications from Spark
>>>>>> into Iceberg there doesn’t seem to be a lot of interest in that.
>>>>>> Unfortunately, I think we have a number of issues with just linking to 
>>>>>> the
>>>>>> Spark project directly from within Iceberg and *I believe we need to
>>>>>> copy the specifications into our repository*.
>>>>>>
>>>>>> There are a few reasons why i think this is necessary
>>>>>>
>>>>>> First, we have a divergence of types already. The Spark Specification
>>>>>&

Welcome Péter, Amogh and Eduard to the Apache Iceberg PMC

2024-08-13 Thread Russell Spitzer

Hi Y'all,

It is my pleasure to let everyone know that the Iceberg PMC has voted to
have several talented individuals join us.

So without further ado, please welcome Péter Váry, Amogh Jahagirdar and
Eduard Tudenhoefner to the Apache Iceberg PMC.

As usual I am excited about the future of this community and thankful for
the hard work and stewardship of its members.

Thank you for your time,
Russell Spitzer

Re: [DISCUSS] Changing namespace separator in REST spec

2024-08-13 Thread Russell Spitzer

I feel like we are over complicating this situation.

It seems like our specification made a poor choice in terms of
separator character, do we have any disagreement on this point? It looks
like by choosing a control character, we ended up generating requests which
modern server systems define as potentially dangerous. Although it seems
like this hasn't come up in the wild as far as I know it seems like it
makes sense for us to just change the character to something else, modify
the spec and move forward.

The ability to custom define characters for namespace separation wasn't an
issue before and I'm not sure why we want to introduce that capability now
on either the Server or Client side although the Server side seems less
complicated here. I think escaping is probably fine as well if that can
work but i'm not sure the Servlet spec allows that?  Any escaping seems
like it will also cause client and server changes similar to just changing
the character so it seems like similar developer complexity there.

Overall I really want to just emphasize that I don't think this is a place
where users either on the Server or Client side should have to think about
configuration.

On Tue, Aug 13, 2024 at 4:02 AM Robert Stupp  wrote:

> I do not see why "old clients are already broken" - they behave according
> to the REST spec.
>
> The current REST spec states "If parent is a multipart namespace, the
> parts must be separated by the unit separator (`0x1F`) byte" [1]. Using a
> different separator, even if it is "backwards compatible", is a change to
> the Iceberg REST spec.
>
> If _one_ service is subject to new(er) restrictions, it is IMO that
> service that cannot follow the REST spec. Jetty 12 (Servlet Spec v6) was
> mentioned as a justification. However, Jetty 12 and other implementations
> allow operators to disable the "suspicious character validation".
>
> The proposed change to the REST spec and implementation would only address
> the problem for that service, but not for all services. PMC members said on
> the "Iceberg Catalog Sync" call on Aug 7 that they want this change for
> that service implementation and defer a solution that works for all
> implementations, explicitly mentioning "Nessie", for quite some time.
>
> The "root cause" is that neither the Iceberg REST spec nor the "reference
> implementation" poses any restriction on the the allowed set of characters
> in namespace elements, which is therefore also part of the Iceberg REST
> spec / "reference implementation".
>
> That said, the proposed change does not work for all implementations.
> Keeping the Iceberg REST spec as is and configuring the (affected) services
> that they accept the '%1F' separator character seems much easier and less
> controversial and eliminates the need for these changes entirely.
>
> An alternative to use an appropriate escaping mechanism, was turned down
> on the same "Iceberg Catalog Sync" call.
>
>
> Robert
>
>
> [1]
> https://github.com/apache/iceberg/blob/bfab2c334e9b4c11de65f1f9bd1de5dab18aae5b/open-api/rest-catalog-open-api.yaml#L225
>
>
> On 06.08.24 21:24, Yufei Gu wrote:
>
> Thanks Ryan and Daniel for the context. I'm OK that the server provides a
> separator via config endpoint. Just want to dig a bit more, it does provide
> more flexibility for the separator, but why do we need this flexibility?
> Looks like the only benefit is to reconstruct the namespaces. What's the
> use case of reconstruction? We may talk about it in tomorrow's meeting.
>
> Thanks Eduard for the PR.
>
>
> Yufei
>
>
> On Tue, Aug 6, 2024 at 3:10 AM Eduard Tudenhöfner <
> etudenhoef...@apache.org> wrote:
>
>> It's probably best to make this configurable rather than changing the
>> separator in the spec. The server can communicate to the client which
>> namespace separator should be used (via a config override), but still needs
>> to support *%1F* as a fallback. I've opened #10877
>>  to address this.
>>
>> @Robert: Old clients are already broken with *%1F* so we won't be able
>> to avoid an old client failing with a 4xx. Introducing a new endpoint that
>> takes a different namespace separator won't fix the issue for old clients
>> either, since those clients won't know anything about that endpoint.
>>
>>
>> On Mon, Aug 5, 2024 at 7:07 PM Daniel Weeks  wrote:
>>
>>> I would agree with adding either a server side (config override) or
>>> client side control (query param with `?delim=.`) as it will be
>>> compatible with the current v1 endpoint.
>>>
>>> In the future we could introduce a v2 endpoint(s), but I would want to
>>> wait for OpenAPI 4 because they address this by allowing multi-segment
>>> pathing via URI templates in RFC 6570
>>> , which is the original
>>> way we wanted to represent namespaces, but it wasn't supported (e.g.
>>> .../{+namespaces}/tables/{table}).  I doubt it's really worth the
>>> effort though, so I feel like a configurable

[DISCUSS] Variant Spec Location

2024-08-12 Thread Russell Spitzer

Hi Y’all,

We’ve hit a bit of a roadblock with the Variant Proposal, while we were
hoping to move the Variant and Shredding specifications from Spark into
Iceberg there doesn’t seem to be a lot of interest in that. Unfortunately,
I think we have a number of issues with just linking to the Spark project
directly from within Iceberg and *I believe we need to copy the
specifications into our repository*.

There are a few reasons why i think this is necessary

First, we have a divergence of types already. The Spark Specification
already includes types which Iceberg has no definition for (19, 20

- Interval Types) and Iceberg already has a type which is not included
within the Spark Specification (Time) and will soon have more with
TimestampNS, and Geo.

Second, We would like to make sure that Spark is not a hard dependency for
other engines. We are working with several implementers of the Iceberg spec
and it has previously been agreed that it would be best if the source of
truth for Variant existed in an engine and file format neutral location.
The Iceberg project has a good open model of governance and, as we have
seen so far discussing Variant
, open
and active collaboration. This would also help as we can strictly version
our changes in-line with the rest of the Iceberg spec.

Third, The Shredding spec is not quite finished and requires some group
analysis and discussion before we commit it. I think again the Iceberg
community is probably the right place for this to happen as we have already
started discussions here on these topics.

For these reasons I think we should go with a direct copy of the existing
specification from the Spark Project and move ahead with our discussions
and modifications within Iceberg. That said, *I do not want to diverge if
possible from the Spark proposal*. For example, although we do not use the
Interval types above, I think we should not reuse those type ids within our
spec. Iceberg's Variant Spec types 19 and 20 would remain unused along with
any other types we think are not applicable. We should strive whenever
possible to allow for compatibility.

In the interest of moving forward with this proposal I am hoping to see if
anyone in the community objects to this plan going forward or has a better
alternative.

As always I am thankful for your time and am eager to hear back from
everyone,
Russ

Re: [VOTE] Clarify "File System Tables" in the table spec

2024-08-01 Thread Russell Spitzer

+1 (Binding)

On Thu, Aug 1, 2024 at 7:31 AM Fokko Driesprong  wrote:

> +1 (binding)
>
> Op do 1 aug 2024 om 09:57 schreef Eduard Tudenhöfner <
> etudenhoef...@apache.org>:
>
>> +1 (non-binding)
>>
>> On Thu, Aug 1, 2024 at 6:52 AM Micah Kornfield 
>> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Wed, Jul 31, 2024 at 5:12 PM Ryan Blue  wrote:
>>>
 As promised in the discussion thread, I've opened a PR to clarify the
 "File System Tables" section and mark it deprecated since there appears to
 be consensus for at least warning people that this is unsafe in most cases
 and discouraged.

 The PR is here: https://github.com/apache/iceberg/pull/10833

 Please vote on this spec change. This will be open for at least 72
 hours:
 [] +1
 [] +0
 [] -1, do not deprecate File System Tables because . . .

 --
 Ryan Blue

>>>

Re: [DISCUSS] adoption of format version 3

2024-07-31 Thread Russell Spitzer

I think this all sounds good, the real question is whether or not we have
someone to actively work on the proposals. I think for things like Default
Values and Geo Types we have folks actively working on them so it's not a
big deal.

On Wed, Jul 31, 2024 at 2:09 PM Szehon Ho  wrote:

> Sorry I missed the sync this morning (sick), I'd like to push for geo too.
>
> I think on this front as per the last sync, Ryan recommended to wait for
> Parquet support to land, to avoid having two versions on Iceberg side
> (Iceberg-native vs Parquet-native).  Parquet support is being actively
> worked on iiuc: https://github.com/apache/parquet-format/pull/240 .  But
> it would bind V3 to the parquet-format release timeline, unless we start
> with iceberg-native support first and move later (as we originally
> proposed).
>
> Thanks,
> Szehon
>
> On Wed, Jul 31, 2024 at 10:58 AM Walaa Eldin Moustafa <
> wa.moust...@gmail.com> wrote:
>
>> Another feature that was planned for V3 is support for default values.
>> Spec doc update was already merged a while ago [1]. Implementation is
>> ongoing in this PR [2].
>>
>> [1] https://iceberg.apache.org/spec/#default-values
>> [2] https://github.com/apache/iceberg/pull/9502
>>
>> Thanks,
>> Walaa.
>>
>> On Wed, Jul 31, 2024 at 10:52 AM Russell Spitzer
>>  wrote:
>> >
>> > Thanks for bringing this up, I would say that from my perspective I
>> have time to really push through hopefully two things
>> >
>> > Variant Type and
>> > Row Lineage (which I will have a proposal for on the mailing list next
>> week)
>> >
>> > I'm using the Project to try to track logistics and minutia required
>> for the new spec version but I would like to bring other work in there as
>> well so we can get a clear picture of what is actually being actively
>> worked on.
>> >
>> > On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble <
>> jacobmar...@influxdata.com> wrote:
>> >>
>> >> Good morning,
>> >>
>> >> To continue the community sync today when format version 3 was
>> discussed.
>> >>
>> >> Questions answered by consensus:
>> >> - Format version releases should _not_ be tied to Iceberg version
>> releases.
>> >> - Several planned features will require format version releases; the
>> process shouldn't be onerous.
>> >>
>> >> Unanswered questions:
>> >> - What will be included in format version 3?
>> >>   - What is a reasonable target date?
>> >>   - How to track progress? Today, there are two public lists:
>> >> - GH milestone: https://github.com/apache/iceberg/milestone/42
>> >> - GH project: https://github.com/orgs/apache/projects/377
>> >> - What is required of a feature in order to be included in any adopted
>> format version?
>> >>   - At least one complete reference implementation should exist.
>> >> - Java is the reference implementation by convention; that's OK,
>> but not perfect. Should Java be the reference implementation by mandate?
>> >>
>> >> Have I missed anything?
>> >>
>> >> --
>> >> Jacob Marble
>>
>

Re: [DISCUSS] adoption of format version 3

2024-07-31 Thread Russell Spitzer

Thanks for bringing this up, I would say that from my perspective I have
time to really push through hopefully two things

Variant Type and
Row Lineage (which I will have a proposal for on the mailing list next week)

I'm using the Project to try to track logistics and minutia required for
the new spec version but I would like to bring other work in there as well
so we can get a clear picture of what is actually being actively worked on.

On Wed, Jul 31, 2024 at 12:27 PM Jacob Marble 
wrote:

> Good morning,
>
> To continue the community sync today when format version 3 was discussed.
>
> Questions answered by consensus:
> - Format version releases should _not_ be tied to Iceberg version releases.
> - Several planned features will require format version releases; the
> process shouldn't be onerous.
>
> Unanswered questions:
> - What will be included in format version 3?
>   - What is a reasonable target date?
>   - How to track progress? Today, there are two public lists:
> - GH milestone: https://github.com/apache/iceberg/milestone/42
> - GH project: https://github.com/orgs/apache/projects/377
> - What is required of a feature in order to be included in any adopted
> format version?
>   - At least one complete reference implementation should exist.
> - Java is the reference implementation by convention; that's OK, but
> not perfect. Should Java be the reference implementation by mandate?
>
> Have I missed anything?
>
> --
> Jacob Marble
>

Re: [DISCUSS] Deprecate HadoopTableOperations, move to tests in 2.0

2024-07-31 Thread Russell Spitzer

My guess would be to avoid complications with multiple committers
attempting to swap at the same time.

On Wed, Jul 31, 2024 at 9:50 AM Jack Ye  wrote:

> I see, thank you Fokko, this is a very helpful context.
>
> Looking at the discussion in the PR and discussions in it, it seems like
> the version hint file is the key problem here. The file system table spec
> [1] is technically correct and only uses a single rename operation to
> perform the atomic commit, and defines that the v.metadata.json is
> the latest file. However the additional write of a version hint file seems
> problematic as that could have additional failures and cause all sorts of
> edge case behaviors, and is not really strictly following the spec.
>
> I agree that if we want to properly follow the current file system table
> spec, then the right way is to stop the commit process after renaming to
> the v.metadata.json, and the reader should perform a listing to
> discover the latest metadata file. If we go with that, this is
> interestingly becoming highly similar to the Delta Lake protocol, where the
> zero-padded log files [2] are discovered using this mechanism [3] I
> believe. And they have implementations for different storage systems
> including HDFS, S3, Azure, GCS, with a pluggable extension point.
>
> One question I have now: what is the motivation in the file system table
> spec to rename the latest table metadata to v.metadata.json,
> rather than just a fixed name like latest.metadata.json? Why is the version
> number in the file name important?
>
> -Jack
>
> [1] https://iceberg.apache.org/spec/#file-system-tables
> [2]
> https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
> [3]
> https://github.com/delta-io/delta/blob/master/storage/src/main/java/io/delta/storage/LogStore.java#L116
>
>
>
> On Tue, Jul 30, 2024 at 10:52 PM Fokko Driesprong 
> wrote:
>
>> Jack,
>>
>> no atomic drop table support: this seems pretty fixable, as you can
>>> change the semantics of dropping a table to be deleting the latest table
>>> version hint file, instead of having to delete everything in the folder. I
>>> feel that actually also fits the semantics of purge/no-purge better.
>>
>>
>> I would invite you to check out lisoda's PR
>>  (#9546
>>  is an earlier version with
>> more discussion) that works towards removing the version hint file to avoid
>> discrepancies between the latest committed metadata and the metadata that's
>> referenced in the hint file. These can go out of sync since the operation
>> there is not atomic. Removing this introduces other problems where you have
>> to determine the latest version of the metadata using prefix-listing, which
>> is not efficient and desirable on an object store as you already mentioned.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 31 jul 2024 om 04:39 schreef Jack Ye :
>>
>>> Atomicity is just one requirement, and it also needs to be efficient,
>>> desirably a metadata-only operation.
>>>
>>> Looking at some documentations of GCS [1], the rename operation is still
>>> a COPY + DELETE behind the scene unless it is a hierarchical namespace
>>> bucket. The Azure documentation [2] also suggests that the fast rename
>>> feature is only available with hierarchical namespace that is for the Gen2
>>> buckets. I found little documentation about the exact rename guarantee and
>>> semantics of ADLS though. But it is undeniable that at least GCS and Azure
>>> should be able to work with HadoopCatalog pretty well with their latest
>>> offerings.
>>>
>>> Steve, if you could share more insights to this and related
>>> documentations, that would be really appreciated.
>>>
>>> -Jack
>>>
>>> [1] https://cloud.google.com/storage/docs/rename-hns-folders
>>> [2]
>>> https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace#the-benefits-of-a-hierarchical-namespace
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jul 30, 2024 at 11:11 AM Steve Loughran
>>>  wrote:
>>>


 On Thu, 18 Jul 2024 at 00:02, Ryan Blue  wrote:

> Hey everyone,
>
> There has been some recent discussion about improving
> HadoopTableOperations and the catalog based on those tables, but we've
> discouraged using file system only table (or "hadoop" tables) for years 
> now
> because of major problems:
> * It is only safe to use hadoop tables with HDFS; most local file
> systems, S3, and other common object stores are unsafe
>

 Azure storage and linux local filesystems all support atomic file and
 dir rename an delete; google gcs does it for files and dirs only. Windows,
 well, anybody who claims to understand the semantics of MoveFile is
 probably wrong (
 https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-movefilewithprogressw
 )

 * Despite not providing atomicity guarantees outside of HDFS, people
> use

Re: [VOTE] Drop Java 8 support in Iceberg 1.7.0

2024-07-26 Thread Russell Spitzer

+1 (bind)

On Fri, Jul 26, 2024 at 8:34 AM Péter Váry 
wrote:

> +1 (non-binding)
>
> Ajantha Bhat  ezt írta (időpont: 2024. júl. 26.,
> P, 14:51):
>
>> +1
>>
>> On Fri, Jul 26, 2024 at 5:16 PM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> +1 (non-binding) for dropping JDK8 support with Iceberg 1.7.0
>>>
>>> On Fri, Jul 26, 2024 at 1:29 PM Piotr Findeisen <
>>> piotr.findei...@gmail.com> wrote:
>>>
 Hi,

 Dropping support for building and running on Java 8 was discussed
 previously on "Dropping JDK 8 support" and "Building with JDK 21" mail
 threads.

 As JB kindly pointed out, for a vote we need a "VOTE" thread, so here
 we go.
 Question: Should we drop Java 8 support in Iceberg 1.7.0?

 Best,
 Piotr

 PS
 +1 (non-binding) from me

>>>

Re: [DISCUSS] Spec clarifications on reading/writing Identity partitioned columns

2024-07-25 Thread Russell Spitzer

I have no problem with explicitly stating that writing identity source
columns is optional on write. We should, of course, mandate surfacing the
column on read :)

On Thu, Jul 25, 2024 at 1:30 PM Micah Kornfield 
wrote:

> The Table specification doesn't mention anything about requirements for
> whether writing identity partitioned columns is necessary.  Empirically, it
> appears that implementations always write the column data at least for
> parquet.  For columnar formats, this is relatively cheap as it is trivially
> RLE encodable.  For Avro though it comes at a little bit of a higher cost.
> Since the data is fully reproducible from Iceberg metadata, I think stating
> this as optional in the specification would be useful.
>
> For reading identity partitioned from Iceberg tables, I think the
> specification needs to require that identity partition column values are
> read from metadata.  This is due to the fact that Iceberg supports
> migrating Hive data (and other table formats) without data rewrites that
> don't typically write their partition information directly into files.
>
> Thoughts?
>
> When we get consensus I'll open up a PR to clarify these points.
>
> Thanks,
> Micah
>

Re: Dropping JDK 8 support

2024-07-23 Thread Russell Spitzer

+1

On Tue, Jul 23, 2024 at 11:47 AM huaxin gao  wrote:

> We don't have to drop JDK11 support. In spark-ci, I can change the matrix
> to only run Java 17 for Spark 4.0, but in java-ci, we might not be able to
> build java docs and do build checks for JDK 11.
>
> Huaxin
>
>
>
> On Tue, Jul 23, 2024 at 9:32 AM Jack Ye  wrote:
>
>> Does that mean we also need to drop JDK11 support?
>>
>> I think there should be a way to configure CI to only run JDK17 for Spark
>> 4.0:
>>
>>
>> https://github.com/apache/iceberg/blob/main/.github/workflows/spark-ci.yml#L74
>>
>> https://stackoverflow.com/questions/68994484/how-to-skip-a-configuration-of-a-matrix-with-github-actions
>>
>> -Jack
>>
>> On Tue, Jul 23, 2024 at 9:15 AM huaxin gao 
>> wrote:
>>
>>> It seems my earlier question might have been overlooked. Could we
>>> clarify if JDK 8 support is being dropped in the next version? The proposal
>>> indicated for Iceberg 2.0 release, but it's unclear if that's our next
>>> version. Given that I'm working on Spark 4.0 support
>>> , which only runs on Java
>>> 17 and 21, dropping JDK 8 in the next release could help us maintain
>>> consistency with Spark's Java requirements.
>>>
>>> Thanks 🙂
>>>
>>> On Tue, Jul 23, 2024 at 8:21 AM Ryan Blue 
>>> wrote:
>>>
 +1

 On Mon, Jul 22, 2024 at 10:49 PM Péter Váry <
 peter.vary.apa...@gmail.com> wrote:

> +1 (non-binding)
>
> On Tue, Jul 23, 2024, 07:15 Ajantha Bhat 
> wrote:
>
>> +1 (non-binding)
>>
>> On Tue, Jul 23, 2024 at 9:54 AM Yufei Gu 
>> wrote:
>>
>>> Hi Manu,
>>>
>>> If JDK 8 support is dropped in 2.0, will we continue to fix critical
 issues in 1.6+?

>>> Nothing stops people from cutting a release, and it becomes an
>>> official release once it is approved. Here is the Apache Release Policy 
>>> for
>>> reference, https://www.apache.org/legal/release-policy.html.
>>>
>>> Yufei
>>>
>>>
>>> On Mon, Jul 22, 2024 at 7:22 PM Renjie Liu 
>>> wrote:
>>>
 +1 (non-binding)

 On Tue, Jul 23, 2024 at 9:40 AM Szehon Ho 
 wrote:

> +1 for dropping JDK 8 in Iceberg 2.0.  I also wonder the same
> thing as Huaxin (sorry if I missed a previous thread on Iceberg 2.0 
> plan).
>
> Also as Huaxin has discovered in Spark 4.0 Support PR
> , looks like we may
> have to drop Java8 first in Spark 4.0 module, due to it being dropped 
> in
> Spark 4.0, before Iceberg 2.0.
>
> Thanks
> Szehon
>
> On Mon, Jul 22, 2024 at 6:31 PM huaxin gao 
> wrote:
>
>> +1 (non-binding)
>>
>> I have a question about iceberg versioning. After the 1.6
>> release, will there be versions 1.7, 1.8 and 1.9, or will it go 
>> straight to
>> 2.0?
>>
>> On Mon, Jul 22, 2024 at 5:32 PM Manu Zhang <
>> owenzhang1...@gmail.com> wrote:
>>
>>> If JDK 8 support is dropped in 2.0, will we continue to fix
>>> critical issues in 1.6+?
>>>
>>> On Tue, Jul 23, 2024 at 1:35 AM Jack Ye 
>>> wrote:
>>>
 +1 (binding), I did not expect this to be a vote thread, but
 overall +1 for dropping JDK8 support.

 -Jack

 On Mon, Jul 22, 2024 at 10:30 AM Yufei Gu 
 wrote:

> +1(binding), as much as I want to drop JDK 8, still encourage
> everyone to spark out about any concerns.
> Yufei
>
>
> On Mon, Jul 22, 2024 at 10:24 AM Steven Wu <
> stevenz...@gmail.com> wrote:
>
>> +1 (binding)
>>
>> On Mon, Jul 22, 2024 at 6:37 AM Piotr Findeisen <
>> piotr.findei...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> in the "Building with JDK 21" email thread we discussed
>>> adding JDK 21 support and also dropping JDK 8 support, as these 
>>> things were
>>> initially related.
>>> A lot of people expressed acceptance for dropping JDK 8
>>> support, and release 2.0 was proposed as a timeline.
>>> There were also concerned raised, as some people still use
>>> JDK 8.
>>>
>>> Let me start this new thread for a discussion and perhaps
>>> formal vote for dropping JDK 8 support in Iceberg 2.0 release.
>>>
>>> Best
>>> Piotr
>>>
>>>

 --
 Ryan Blue
 Databricks

>>>

Re: [DISCUSS][BYLAWS] Moving forward on the bylaws

2024-07-23 Thread Russell Spitzer

Micah has a great list there for me. I'm similarly not as interested in the
bureaucracy of the project and more interested in actually discussing how
we operate from a technical perspective as the community grows.

On Tue, Jul 23, 2024 at 1:01 AM Micah Kornfield 
wrote:

> My 2 cents on this topic. I think we are getting bogged down in relatively
> minor details/bureaucratic points. This is a reiteration of a previous
> recommendation on the topic, but in the interest of making progress here,
> I'd propose let's break this conversation down and focus on incremental
> definitions/proposals.
>
> I've added a potential breakdown below. This is roughly ordered from
> codifying day to day business, and extends to longer term concerns.  I
> would aim for minimalism on each of these and add additional guidance when
> we need it.  It is important to keep in mind that most decisions made on a
> day to day basis are fairly easily reversible and more process/formalism
> can always be added if some part of the process has bugs in it.
>
> 1.  Requirements to merge a PR (1 committer approval + X amount of time to
> allow others to review?)
> 2.  Requirements major projects/features including spec changes
> 3.  Define roles/responsibilities and provide guidance on how members can
> grow into those roles.  (Cover release manager, comitter, PMC member,
> community member, with links as appropriate).
> 4.  Create guidelines on starting new revision to the specification (i.e.
> when will the table V3 spec be considered closed?).
> 5.  People issues (if the community thinks this is still necessary).
>
> Thanks,
> Micah
>
> On Mon, Jul 22, 2024 at 10:05 AM Jack Ye  wrote:
>
>> Just to follow up on the other topics, here are my comments, mainly to
>> reconcile with what have been discussed in different threads, which could
>> help formulating these multiple-choice questions:
>>
>> > What should the minimum voting period be?
>>
>> Do we decide a minimum voting period for all topics, or do we decide a
>> different period for each topic? So far it seems like the ASF convention
>> has different minimum periods defined, like 3 days for release [1], 7 days
>> for new committers/PMC members [2].
>>
>> > I'd like to include a couple sentences about the different hats at
>> Apache and that votes should be for the benefit of the project and not our
>> employers.
>> > I'd like to propose that we include text to formally include censor and
>> potential removal for disclosing sensitive information from the private
>> list.
>>
>> Personally I am definitely +1 for this, but going back to the topic of
>> Iceberg vs ASF bylaws/guidelines, I feel ideally this should also be a part
>> of ASF general bylaws/guidelines that Iceberg simply just references.
>>
>> > Requirements for each topic
>>
>> I think this is also missing code modification and the improvement
>> proposal votes. My impression so far after discussions in the initial
>> bylaws doc [3] is that ASF definition of "code modification" is closer to a
>> full design process and uses consensus approval. But in reality "merging
>> PR" is a much more lightweight "vote" that typically requires just 1
>> committer approval of the PR. On the other hand, the Iceberg improvement
>> proposal process is currently described to use the code modification
>> consensus approval vote [4], but there are other options like 2/3 majority
>> and majority vote that were proposed.
>>
>> > consensus, lazy consensus, lazy majority, lazy 2/3's
>>
>> In the initial bylaws doc [3] it was pointed out that the definition of
>> "lazy consensus" is different in different documents. In general the
>> definition is "no -1", but there is also the definition [5] of "at least
>> 1 +1, no -1". I ended up giving it a different name of "minimum consensus",
>> and actually this has been the model used for merging pull requests. We
>> might want to clarify that part first before voting for these options.
>>
>> [1] https://www.apache.org/legal/release-policy.html#release-approval
>> [2] https://community.apache.org/newcommitter.html#discussion
>> [3]
>> https://docs.google.com/document/d/1S3igb5NqSlYE3dq_qRsP3X2gwhe54fx-Sxq5hqyOe6I/edit
>> [4] https://iceberg.apache.org/contribute/#how-are-proposals-adopted
>> [5] https://orc.apache.org/develop/bylaws/
>>
>> -Jack
>>
>> On Fri, Jul 19, 2024 at 2:29 PM Jack Ye  wrote:
>>
>>> > specifically the discussion of the standard roles
>>>
>>> Yes, there are also other places with different definitions. For example
>>> the default project guideline [1] has additional description of the PMC
>>> member and chair responsibilities. There are a few other places like ASF
>>> glossary [2] where these terms are defined, I cannot recall on the top of
>>> my head, but I can try to dig those up later.
>>>
>>> > putting together a committer requirements/guidelines doc
>>>
>>> +1. For some context, the committer guideline discussion came from both
>>> some initial feedback on devlist, as well as comments

Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-23 Thread Russell Spitzer

"so many" :)

On Tue, Jul 23, 2024 at 9:14 AM Russell Spitzer 
wrote:

> This is truly an exciting day. To have to many qualified folks being
> recognized by the Iceberg project fills me with pride. I can't wait to see
> what we get done together!
>
> On Tue, Jul 23, 2024 at 9:12 AM Sung Yun  wrote:
>
>> Thank you very much!
>>
>> I am excited to see the project growing to new capacities as well, and to
>> be an active part of that journey.
>>
>> I will continue to work hard together with the community to take
>> (Py)Iceberg to its next steps.
>>
>> Sung
>>
>>
>> On Jul 23, 2024, at 10:01 AM, Matt Topol  wrote:
>>
>> 
>>
>> Congrats!!
>>
>> On Tue, Jul 23, 2024, 9:52 AM Xuanwo  wrote:
>>
>>> Thank you so much for the recognition! I'm thrilled to join as an
>>> Iceberg Committer.
>>>
>>> I'm currently working hard on the iceberg rust project. If you're
>>> interested in rust, feel free to join us!
>>>
>>> On Tue, Jul 23, 2024, at 21:03, Fokko Driesprong wrote:
>>>
>>> Hi everyone,
>>>
>>> The Iceberg PMC is excited to announce new committers and PMC members to
>>> the Apache Iceberg project.
>>>
>>> New committers:
>>>
>>>
>>>-
>>>
>>>Kevin Liu (kevinjqliu)
>>>-
>>>
>>>Piotr Findeisen (findepi)
>>>-
>>>
>>>Sung Yun (syun64)
>>>-
>>>
>>>Xuanwo (xuanwo)
>>>
>>>
>>> New members of the PMC:
>>>
>>>
>>>-
>>>
>>>Honah (honahx)
>>>-
>>>
>>>Renjie Liu (liurenjie1024)
>>>
>>>
>>> We’re very excited to see the project grow in many ways by supporting
>>> new languages and setting new standards.
>>>
>>> Please join us in welcoming the new committers and PMC members!
>>>
>>> On behalf of the Iceberg PMC,
>>>
>>> Fokko
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>>

Re: [ANNOUNCE] Welcoming new committers and PMC members

2024-07-23 Thread Russell Spitzer

This is truly an exciting day. To have to many qualified folks being
recognized by the Iceberg project fills me with pride. I can't wait to see
what we get done together!

On Tue, Jul 23, 2024 at 9:12 AM Sung Yun  wrote:

> Thank you very much!
>
> I am excited to see the project growing to new capacities as well, and to
> be an active part of that journey.
>
> I will continue to work hard together with the community to take
> (Py)Iceberg to its next steps.
>
> Sung
>
>
> On Jul 23, 2024, at 10:01 AM, Matt Topol  wrote:
>
> 
>
> Congrats!!
>
> On Tue, Jul 23, 2024, 9:52 AM Xuanwo  wrote:
>
>> Thank you so much for the recognition! I'm thrilled to join as an Iceberg
>> Committer.
>>
>> I'm currently working hard on the iceberg rust project. If you're
>> interested in rust, feel free to join us!
>>
>> On Tue, Jul 23, 2024, at 21:03, Fokko Driesprong wrote:
>>
>> Hi everyone,
>>
>> The Iceberg PMC is excited to announce new committers and PMC members to
>> the Apache Iceberg project.
>>
>> New committers:
>>
>>
>>-
>>
>>Kevin Liu (kevinjqliu)
>>-
>>
>>Piotr Findeisen (findepi)
>>-
>>
>>Sung Yun (syun64)
>>-
>>
>>Xuanwo (xuanwo)
>>
>>
>> New members of the PMC:
>>
>>
>>-
>>
>>Honah (honahx)
>>-
>>
>>Renjie Liu (liurenjie1024)
>>
>>
>> We’re very excited to see the project grow in many ways by supporting new
>> languages and setting new standards.
>>
>> Please join us in welcoming the new committers and PMC members!
>>
>> On behalf of the Iceberg PMC,
>>
>> Fokko
>>
>> Xuanwo
>>
>> https://xuanwo.io/
>>
>>

Re: [RESULT][VOTE] Merge table spec clarifications on time travel and equality deletes

2024-07-19 Thread Russell Spitzer

+1, Sorry I meant to vote before. I just had nits on the wording

On Fri, Jul 19, 2024 at 2:04 PM Micah Kornfield 
wrote:

> Hi Dmitri,
> Thank you for the comment, maybe we can continue the discussion on the PR
> (there are still some other open issues).  I don't think the current spec
> references the REST catalog, but I think the same issue occurs for table
> specification.
>
> Cheers,
> Micah
>
> On Fri, Jul 19, 2024 at 10:37 AM Dmitri Bourlatchkov
>  wrote:
>
>> Sorry for the late reply. The vote closed, so I'll just post my comments
>> without voting here.
>>
>> My reading of the spec change in PR #8982 [1] is that it is not
>> normative. More specifically, REST catalog implementations that do not
>> expose the full snapshot history in metadata JSON will not violate the spec.
>>
>> Therefore, I do not oppose this change, but I'd appreciate it if this
>> point were explicitly mentioned in the spec text.
>>
>> I propose adding a phrase like "when the REST catalog makes the snapshot
>> history available in the metadata JSON, time travel queries should be
>> executed like this [existing spec text]. If a catalog does not expose
>> the full snapshot history, time travel queries should provide clear
>> messages in case they cannot find the appropriate snapshot".
>>
>> Thanks,
>> Dmitri.
>>
>> [1] https://github.com/apache/iceberg/pull/8982
>>
>> On Fri, Jul 19, 2024 at 1:15 PM Micah Kornfield 
>> wrote:
>>
>>> The vote passes with:
>>>
>>> 5  "+1 Binding votes"
>>> 3 "+1 Non-binding votes."
>>> 0 "-1 votes"
>>>
>>>
>>> Actions to be taken:
>>> 1.  Update the language/location of the clarification on time travel in
>>> https://github.com/apache/iceberg/pull/8982 and then have a
>>> committer/PMC member merge.  I'll try to have this updated by Monday.
>>> 2.  Merge https://github.com/apache/iceberg/pull/8981 (it seems there
>>> is no further feedback on this).
>>>
>>> Thanks everyone for the feedback.
>>>
>>>
>>> -Micah
>>>
>>>
>>>
>>>
>>> On Fri, Jul 19, 2024 at 9:03 AM Jack Ye  wrote:
>>>
 +1 (binding)

 added minor comments to the time travel PR.

 Best,
 Jack Ye

 On Fri, Jul 19, 2024 at 8:22 AM Daniel Weeks  wrote:

> +1 (binding)
>
> Thanks, Micah.
>
> On Thu, Jul 18, 2024 at 8:29 PM Amogh Jahagirdar <2am...@gmail.com>
> wrote:
>
>> +1 (non-binding) on these spec clarifications
>>
>> Thanks,
>> Amogh Jahagirdar
>>
>> On Thu, Jul 18, 2024 at 5:08 PM Steven Wu 
>> wrote:
>>
>>> I am +1 for the spec clarifications.
>>>
>>> I have left some comments for the time travel PR. we can discuss the
>>> details in the PR itself before merging. In particular, I am wondering 
>>> if
>>> the time travel clarification can be add to the existing `snapshots`
>>> section of the spec (instead of adding a new `implementation notes` 
>>> section)
>>>
>>> On Thu, Jul 18, 2024 at 3:54 PM Ryan Blue
>>>  wrote:
>>>
 +1

 Thanks, Micah!

 On Tue, Jul 16, 2024 at 7:04 AM Jean-Baptiste Onofré <
 j...@nanthrax.net> wrote:

> +1 (non binding)
>
> Thanks !
> Regards
> JB
>
> On Mon, Jul 15, 2024 at 10:35 PM Micah Kornfield <
> emkornfi...@gmail.com> wrote:
> >
> > I'd like to raise on modifying the table specification with
> clarifications on time travel and equality deletes [1][2].  The PRs 
> have
> links to prior mailing list discussions where there was apparent 
> consensus
> that these were the expectations for functionality.
> >
> > Possible votes:
> > [ ] +1 Merge the PRs
> > [ ] +0
> > [ ] -1 Do not merge the PRs because ...
> >
> > The vote will remain open for at least 72 hours.
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/iceberg/pull/8982
> > [2] https://github.com/apache/iceberg/pull/8981
>


 --
 Ryan Blue
 Databricks

>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-07-18 Thread Russell Spitzer

I'm aligned with point 1.

For point 2 I think we should choose quickly, I honestly do think this
would be fine as part of the Iceberg Spec directly but understand it may be
better for the more broad community if it was a sub project. As a
sub-project I would still prefer it being an Iceberg Subproject since we
are engine/file-format agnostic.

3. I support adding just Variant.

On Thu, Jul 18, 2024 at 12:54 AM Aihua Xu  wrote:

> Hello community,
>
> It’s great to sync up with some of you on Variant and SubColumarization
> support in Iceberg again. Apologize that I didn’t record the meeting but
> here are some key items that we want to follow up with the community.
>
> 1. Adopt Spark Variant encoding
> Those present were in favor of  adopting the Spark variant encoding for
> Iceberg Variant with extensions to support other Iceberg types. We would
> like to know if anyone has an objection to this to reuse an open source
> encoding.
>
> 2. Movement of the Spark Variant Spec to another project
> To avoid introducing Apache Spark as a dependency for the engines and file
> formats, we discussed separating Spark Variant encoding spec and
> implementation from the Spark Project to a neutral location. We thought up
> several solutions but didn’t have consensus on any of them. We are looking
> for more feedback on this topic from the community either in terms of
> support for one of these options or another idea on how to support the spec.
>
> Options Proposed:
> * Leave the Spec in Spark (Difficult for versioning and other engines)
> * Copying the Spec into Iceberg Project Directly (Difficult for other
> Table Formats)
> * Creating a Sub-Project of Apache Iceberg and moving the spec and
> reference implementation there (Logistically complicated)
> * Creating a Sub-Project of Apache Spark and moving the spec and reference
> implementation there (Logistically complicated)
>
> 3. Add Variant type vs. Variant and JSON types
> Those who were present were in favor of adding only the Variant type to
> Iceberg. We are looking for anyone who has an objection to going forward
> with just the Variant Type and no Iceberg JSON Type. We were favoring
> adding Variant type only because:
> * Introducing a JSON type would require engines that only support VARIANT
> to do write time validation of their input to a JSON column. If they don’t
> have a JSON type an engine wouldn’t support this.
> * Engines which don’t support Variant will work most of the time but can
> have fallback strings defined in the spec for reading unsupported types.
> Writing a JSON into a Variant will always work.
>
> 4. Support for Subcolumnization spec (shredding in Spark)
> We have no action items on this but would like to follow up on discussions
> on Subcolumnization in the future.
> * We had general agreement that this should be included in Iceberg V3 or
> else adding variant may not be useful.
> * We are interested in also adopting the shredding spec from Spark and
> would like to move it to whatever place we decided the Variant spec is
> going to live.
>
> Let us know if missed anything and if you have any additional thoughts or
> suggestions.
>
> Thanks
> Aihua
>
>
> On 2024/07/15 18:32:22 Aihua Xu wrote:
> > Thanks for the discussion.
> >
> > I will move forward to work on spec PR.
> >
> > Regarding the implementation, we will have module for Variant support in
> Iceberg so we will not have to bring in Spark libraries.
> >
> > I'm reposting the meeting invite in case it's not clear in my original
> email since I included in the end. Looks like we don't have major
> objections/diverges but let's sync up and have consensus.
> >
> > Meeting invite:
> >
> > Wednesday, July 17 · 9:00 – 10:00am
> > Time zone: America/Los_Angeles
> > Google Meet joining info
> > Video call link: https://meet.google.com/pbm-ovzn-aoq
> > Or dial: ‪(US) +1 650-449-9343‬ PIN: ‪170 576 525‬#
> > More phone numbers: https://tel.meet/pbm-ovzn-aoq?pin=4079632691790
> >
> > Thanks,
> > Aihua
> >
> > On 2024/07/12 20:55:01 Micah Kornfield wrote:
> > > I don't think this needs to hold up the PR but I think coming to a
> > > consensus on the exact set of types supported is worthwhile (and if the
> > > goal is to maintain the same set as specified by the Spark Variant
> type or
> > > if divergence is expected/allowed).  From a fragmentation perspective
> it
> > > would be a shame if they diverge, so maybe a next step is also
> suggesting
> > > support to the Spark community on the missing existing Iceberg types?
> > >
> > > Thanks,
> > > Micah
> > &g

Re: [VOTE] Release Apache Iceberg 1.6.0 RC0

2024-07-17 Thread Russell Spitzer

I'm in for RC1,

-1 Vote for RC0

On Wed, Jul 17, 2024 at 3:13 PM Jean-Baptiste Onofré 
wrote:

> Hi Amogh
>
> Thanks ! Imho, I would prefer to change/"fix" the
> TableMetadata.Builder constructor in 1.6.0. If we release like this,
> It will be painful to deprecate and probably a bit confusing.
> I think it's better to cancel RC0 and cut RC1 including visibility
> change on the constructor, in order to have a "clean" 1.6.0 release.
>
> If there are no objections, I will cancel RC0 to prepare a RC1.
>
> Regards
> JB
>
> On Wed, Jul 17, 2024 at 10:04 PM Amogh Jahagirdar <2am...@gmail.com>
> wrote:
> >
> > Hey JB,
> >
> > Yes, I'd still hold on to my -1 (non-binding) vote due to the public
> TableMetadata.Builder constructor which should be private. I have a PR
> https://github.com/apache/iceberg/pull/10714 for addressing it (this
> would need to be cherry picked as well on to the 1.6 branch). If folks are
> in agreement with that, I'd recommend another candidate. If not (which I
> can understand since maybe it's a bit overkill), we could just go through a
> deprecation cycle.
> >
> > Thanks,
> >
> > Amogh Jahagirdar
> >
> > On Wed, Jul 17, 2024 at 1:39 PM Jean-Baptiste Onofré 
> wrote:
> >>
> >> Hi Amogh
> >>
> >> Are you keeping your -1 vote ? I'm a bit lost between your two messages
> :)
> >>
> >> Thanks !
> >> Regards
> >> JB
> >>
> >> On Wed, Jul 17, 2024 at 7:03 PM Amogh Jahagirdar <2am...@gmail.com>
> wrote:
> >> >
> >> > Following up,
> >> >
> >> > I think I confused myself on the original issue
> https://github.com/apache/iceberg/issues/8756 when testing. That issue
> was specific to REST implementations which use `CatalogHandlers` like
> `RESTCatalogAdapter` used in our unit tests. The fix in #10369 does address
> that case for creation. When testing I was creating a v2 table and
> attempting to replace it with a v1 table which I think makes sense to fail
> because the downgrade would possibly be lossy, and then rolling back would
> not be safe. My original statement that "I think clients should not fail to
> build the change set with the format version change." is probably not
> correct for the downgrade case; it sounds best to fail on the client side
> since it's known to be unsafe.
> >> >
> >> > So from a fix/issue perspective, I think we're covered. However, in
> terms of APIs there's still the case of the public constructor that I added
> in #10369. That should not be public.
> >> >
> >> > Thanks and sorry for the confusion there,
> >> >
> >> > Amogh Jahagirdar
> >> >
> >> >
> >> >
> >> >
> >> > On Wed, Jul 17, 2024 at 9:48 AM Amogh Jahagirdar <2am...@gmail.com>
> wrote:
> >> >>
> >> >> I'm -1 (non-binding).
> >> >>
> >> >> Aside from running through the standard checks, I was testing
> https://github.com/apache/iceberg/pull/10369/files via Spark against a
> REST catalog (a non-testing REST catalog) and the issue still exists
> although the stack trace just looks a bit different now. The fix currently
> handles it on the catalog handler's side which really masks the real issue
> of failing to build the changes for the replace on the client side (so imo
> it's not really a fix looking back on it). I'm still thinking through what
> a robust solution is; in the end for REST, the service needs to be able to
> handle it, but I think clients should not fail to build the change set with
> the format version change.
> >> >>
> >> >> To be clear, I don't think I'd block on a fix for this since I'm not
> sure how common of a case it is for downgrade of version for a replace is
> and if there's interest in a 1.6.1, we can aim for a more thought through
> solution for that release.
> >> >>
> >> >> However the main concern I have is when I was going through the fix,
> the table metadata builder constructor I added as part of this
> https://github.com/apache/iceberg/pull/10369/files#diff-c540a31e66b157a8f080433c82a29a070096d0e08c6578a0099153f1229bdb7aR913
> is marked public, which I think I'd prefer to change to private upfront
> rather than have to go through a deprecation cycle/revAPI changes.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Amogh Jahagirdar
> >> >>
> >> >>
> >> >> On Wed, Jul 17, 2024 at 2:29 AM Honah J.  wrote:
> >> >>>
> >> >>> +1 (non-binding)
> >> >>>
> >> >>> verified signature and checksum
> >> >>> verified license doc
> >> >>> verified build and tests with JDK 17
> >> >>>
> >> >>> Best regards,
> >> >>> Honah
> >> >>>
> >> >>> On Tue, Jul 16, 2024 at 10:40 PM Ajantha Bhat <
> ajanthab...@gmail.com> wrote:
> >> >
> >> > Gentle reminder for the PMC members, we need at least two
> additional binding votes.
> >> 
> >> 
> >>  One additional vote. We have binding votes from Russell and Fokko
> already.
> >> 
> >>  On Wed, Jul 17, 2024 at 10:54 AM Jean-Baptiste Onofré <
> j...@nanthrax.net> wrote:
> >> >
> >> > Gentle reminder for the PMC members, we need at least two
> additional
> >> > binding votes.
> >> >
> >> > Thanks !
> >> > Regards
> >> >>

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-07-12 Thread Russell Spitzer

Just talked with Aihua and he's working on the Spec PR now. We can get
feedback there from everyone.

On Fri, Jul 12, 2024 at 3:41 PM Ryan Blue 
wrote:

> Good idea, but I'm hoping that we can continue to get their feedback in
> parallel to getting the spec changes started. Piotr didn't seem to object
> to the encoding from what I read of his comments. Hopefully he (and others)
> chime in here.
>
> On Fri, Jul 12, 2024 at 1:32 PM Russell Spitzer 
> wrote:
>
>> I just want to make sure we get Piotr and Peter on board as
>> representatives of Flink and Trino engines. Also make sure we have anyone
>> else chime in who has experience with Ray if possible.
>>
>> Spec changes feel like the right next step.
>>
>> On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue 
>> wrote:
>>
>>> Okay, what are the next steps here? This proposal has been out for quite
>>> a while and I don't see any major objections to using the Spark encoding.
>>> It's quite well designed and fits the need well. It can also be extended to
>>> support additional types that are missing if that's a priority.
>>>
>>> Should we move forward by starting a draft of the changes to the table
>>> spec? Then we can vote on committing those changes and get moving on an
>>> implementation (or possibly do the implementation in parallel).
>>>
>>> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> That's fair, I'm sold on an Iceberg Module.
>>>>
>>>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue 
>>>> wrote:
>>>>
>>>>> > Feels like eventually the encoding should land in parquet proper
>>>>> right?
>>>>>
>>>>> What about using it in ORC? I don't know where it should end up. Maybe
>>>>> Iceberg should make a standalone module from it?
>>>>>
>>>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> Feels like eventually the encoding should land in parquet proper
>>>>>> right? I'm fine with us just copying into Iceberg though for the time
>>>>>> being.
>>>>>>
>>>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>>> Oops, it looks like I missed where Aihua brought this up in his last
>>>>>>> email:
>>>>>>>
>>>>>>> > do we have an issue to directly use Spark implementation in
>>>>>>> Iceberg?
>>>>>>>
>>>>>>> Yes, I think that we do have an issue using the Spark library. What
>>>>>>> do you think about a Java implementation in Iceberg?
>>>>>>>
>>>>>>> Ryan
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I raised the same point from Peter's email in a comment on the doc
>>>>>>>> as well. There is a spark-variant_2.13 artifact that would be a much
>>>>>>>> smaller scope than relying on large portions of Spark, but I even then 
>>>>>>>> I
>>>>>>>> doubt that it is a good idea for Iceberg to depend on that because it 
>>>>>>>> is a
>>>>>>>> Scala artifact and we would need to bring in a ton of Scala libs. I 
>>>>>>>> think
>>>>>>>> what makes the most sense is to have an independent implementation of 
>>>>>>>> the
>>>>>>>> spec in Iceberg.
>>>>>>>>
>>>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Aihua,
>>>>>>>>> Long time no see :)
>>>>>>>>> Would this mean, that every engine which plans to support Variant
>>>>>>>>> data type needs to add Spark as a dependency? Like Flink/Trino/Hive 
>>>>>>>>> etc?
>>>>>>>>> Thanks, Peter
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihu

Re: [VOTE] Release Apache Iceberg 1.6.0 RC0

2024-07-12 Thread Russell Spitzer

+1 - Checked all the normal thing (Rat, Tests, Build, Spark)

On Fri, Jul 12, 2024 at 1:14 PM Dmitri Bourlatchkov
 wrote:

> +1 (nb)
>
> I verified OAuth2 in the REST Catalog with Spark / Keycloak (client
> secret) / Nessie.
>
> The token URI warning is prominently displayed, when `oauth2-server-uri`
> is not configured.
>
> When the token URI is configured, the client secret flow works fine with
> Keycloak.
>
> Cheers,
> Dmitri.
>
> On Fri, Jul 12, 2024 at 10:49 AM Jean-Baptiste Onofré 
> wrote:
>
>> Hi everyone,
>>
>> I propose that we release the following RC as the official Apache
>> Iceberg 1.6.0 release.
>>
>> The commit ID is ed228f79cd3e569e04af8a8ab411811803bf3a29
>> * This corresponds to the tag: apache-iceberg-1.6.0-rc0
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.6.0-rc0
>> *
>> https://github.com/apache/iceberg/tree/ed228f79cd3e569e04af8a8ab411811803bf3a29
>>
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.6.0-rc0
>>
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>
>> Convenience binary artifacts are staged on Nexus. The Maven repository
>> URL is:
>> *
>> https://repository.apache.org/content/repositories/orgapacheiceberg-1164/
>>
>> Please download, verify, and test.
>>
>> Please vote in the next 72 hours.
>>
>> [ ] +1 Release this as Apache Iceberg 1.6.0
>> [ ] +0
>> [ ] -1 Do not release this because...
>>
>> Only PMC members have binding votes, but other community members are
>> encouraged to cast non-binding votes. This vote will pass if there are
>> 3 binding +1 votes and more binding +1 votes than -1 votes.
>>
>> Thanks,
>> Regards
>> JB
>>
>

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-07-12 Thread Russell Spitzer

I just want to make sure we get Piotr and Peter on board as representatives
of Flink and Trino engines. Also make sure we have anyone else chime in who
has experience with Ray if possible.

Spec changes feel like the right next step.

On Fri, Jul 12, 2024 at 3:14 PM Ryan Blue 
wrote:

> Okay, what are the next steps here? This proposal has been out for quite a
> while and I don't see any major objections to using the Spark encoding.
> It's quite well designed and fits the need well. It can also be extended to
> support additional types that are missing if that's a priority.
>
> Should we move forward by starting a draft of the changes to the table
> spec? Then we can vote on committing those changes and get moving on an
> implementation (or possibly do the implementation in parallel).
>
> On Fri, Jul 12, 2024 at 1:08 PM Russell Spitzer 
> wrote:
>
>> That's fair, I'm sold on an Iceberg Module.
>>
>> On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue 
>> wrote:
>>
>>> > Feels like eventually the encoding should land in parquet proper right?
>>>
>>> What about using it in ORC? I don't know where it should end up. Maybe
>>> Iceberg should make a standalone module from it?
>>>
>>> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
>>>> Feels like eventually the encoding should land in parquet proper right?
>>>> I'm fine with us just copying into Iceberg though for the time being.
>>>>
>>>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue 
>>>> wrote:
>>>>
>>>>> Oops, it looks like I missed where Aihua brought this up in his last
>>>>> email:
>>>>>
>>>>> > do we have an issue to directly use Spark implementation in Iceberg?
>>>>>
>>>>> Yes, I think that we do have an issue using the Spark library. What do
>>>>> you think about a Java implementation in Iceberg?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> I raised the same point from Peter's email in a comment on the doc as
>>>>>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>>>>>> scope than relying on large portions of Spark, but I even then I doubt 
>>>>>> that
>>>>>> it is a good idea for Iceberg to depend on that because it is a Scala
>>>>>> artifact and we would need to bring in a ton of Scala libs. I think what
>>>>>> makes the most sense is to have an independent implementation of the spec
>>>>>> in Iceberg.
>>>>>>
>>>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Aihua,
>>>>>>> Long time no see :)
>>>>>>> Would this mean, that every engine which plans to support Variant
>>>>>>> data type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>>>>>> Thanks, Peter
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu  wrote:
>>>>>>>
>>>>>>>> Thanks Ryan.
>>>>>>>>
>>>>>>>> Yeah. That's another reason we want to pursue Spark encoding to
>>>>>>>> keep compatibility for the open source engines.
>>>>>>>>
>>>>>>>> One more question regarding the encoding implementation: do we have
>>>>>>>> an issue to directly use Spark implementation in Iceberg? Russell 
>>>>>>>> pointed
>>>>>>>> out that Trino doesn't have Spark dependency and that could be a 
>>>>>>>> problem?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aihua
>>>>>>>>
>>>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>>>> > Thanks, Aihua!
>>>>>>>> >
>>>>>>>> > I think that the encoding choice in the current doc is a good
>>>>>>>> one. I went
>>>>>>>> > through the Spark encoding in detail and it looks like a better
>>>>>>>> choice than
>>>&g

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-07-12 Thread Russell Spitzer

That's fair, I'm sold on an Iceberg Module.

On Fri, Jul 12, 2024 at 2:53 PM Ryan Blue 
wrote:

> > Feels like eventually the encoding should land in parquet proper right?
>
> What about using it in ORC? I don't know where it should end up. Maybe
> Iceberg should make a standalone module from it?
>
> On Fri, Jul 12, 2024 at 12:38 PM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Feels like eventually the encoding should land in parquet proper right?
>> I'm fine with us just copying into Iceberg though for the time being.
>>
>> On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue 
>> wrote:
>>
>>> Oops, it looks like I missed where Aihua brought this up in his last
>>> email:
>>>
>>> > do we have an issue to directly use Spark implementation in Iceberg?
>>>
>>> Yes, I think that we do have an issue using the Spark library. What do
>>> you think about a Java implementation in Iceberg?
>>>
>>> Ryan
>>>
>>> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue  wrote:
>>>
>>>> I raised the same point from Peter's email in a comment on the doc as
>>>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>>>> scope than relying on large portions of Spark, but I even then I doubt that
>>>> it is a good idea for Iceberg to depend on that because it is a Scala
>>>> artifact and we would need to bring in a ton of Scala libs. I think what
>>>> makes the most sense is to have an independent implementation of the spec
>>>> in Iceberg.
>>>>
>>>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry <
>>>> peter.vary.apa...@gmail.com> wrote:
>>>>
>>>>> Hi Aihua,
>>>>> Long time no see :)
>>>>> Would this mean, that every engine which plans to support Variant data
>>>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>>>> Thanks, Peter
>>>>>
>>>>>
>>>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu  wrote:
>>>>>
>>>>>> Thanks Ryan.
>>>>>>
>>>>>> Yeah. That's another reason we want to pursue Spark encoding to keep
>>>>>> compatibility for the open source engines.
>>>>>>
>>>>>> One more question regarding the encoding implementation: do we have
>>>>>> an issue to directly use Spark implementation in Iceberg? Russell pointed
>>>>>> out that Trino doesn't have Spark dependency and that could be a problem?
>>>>>>
>>>>>> Thanks,
>>>>>> Aihua
>>>>>>
>>>>>> On 2024/07/12 15:02:06 Ryan Blue wrote:
>>>>>> > Thanks, Aihua!
>>>>>> >
>>>>>> > I think that the encoding choice in the current doc is a good one.
>>>>>> I went
>>>>>> > through the Spark encoding in detail and it looks like a better
>>>>>> choice than
>>>>>> > the other candidate encodings for quickly accessing nested fields.
>>>>>> >
>>>>>> > Another reason to use the Spark type is that this is what Delta's
>>>>>> variant
>>>>>> > type is based on, so Parquet files in tables written by Delta could
>>>>>> be
>>>>>> > converted or used in Iceberg tables without needing to rewrite
>>>>>> variant
>>>>>> > data. (Also, note that I work at Databricks and have an interest in
>>>>>> > increasing format compatibility.)
>>>>>> >
>>>>>> > Ryan
>>>>>> >
>>>>>> > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu >>>>> .invalid>
>>>>>> > wrote:
>>>>>> >
>>>>>> > > [Discuss] Consensus for Variant Encoding
>>>>>> > >
>>>>>> > > It’s great to be able to present the Variant type proposal in the
>>>>>> > > community sync yesterday and I’m looking to host a meeting next
>>>>>> week
>>>>>> > > (targeting for 9am, July 17th) to go over any further concerns
>>>>>> about the
>>>>>> > > encoding of the Variant type and any other questions on the first
>>>>>> phase of
>>>>>> > > the proposal
>>>

Re: [Early Feedback] Variant and Subcolumnarization Support

2024-07-12 Thread Russell Spitzer

Feels like eventually the encoding should land in parquet proper right? I'm
fine with us just copying into Iceberg though for the time being.

On Fri, Jul 12, 2024 at 2:31 PM Ryan Blue 
wrote:

> Oops, it looks like I missed where Aihua brought this up in his last email:
>
> > do we have an issue to directly use Spark implementation in Iceberg?
>
> Yes, I think that we do have an issue using the Spark library. What do you
> think about a Java implementation in Iceberg?
>
> Ryan
>
> On Fri, Jul 12, 2024 at 12:28 PM Ryan Blue  wrote:
>
>> I raised the same point from Peter's email in a comment on the doc as
>> well. There is a spark-variant_2.13 artifact that would be a much smaller
>> scope than relying on large portions of Spark, but I even then I doubt that
>> it is a good idea for Iceberg to depend on that because it is a Scala
>> artifact and we would need to bring in a ton of Scala libs. I think what
>> makes the most sense is to have an independent implementation of the spec
>> in Iceberg.
>>
>> On Fri, Jul 12, 2024 at 11:51 AM Péter Váry 
>> wrote:
>>
>>> Hi Aihua,
>>> Long time no see :)
>>> Would this mean, that every engine which plans to support Variant data
>>> type needs to add Spark as a dependency? Like Flink/Trino/Hive etc?
>>> Thanks, Peter
>>>
>>>
>>> On Fri, Jul 12, 2024, 19:10 Aihua Xu  wrote:
>>>
 Thanks Ryan.

 Yeah. That's another reason we want to pursue Spark encoding to keep
 compatibility for the open source engines.

 One more question regarding the encoding implementation: do we have an
 issue to directly use Spark implementation in Iceberg? Russell pointed out
 that Trino doesn't have Spark dependency and that could be a problem?

 Thanks,
 Aihua

 On 2024/07/12 15:02:06 Ryan Blue wrote:
 > Thanks, Aihua!
 >
 > I think that the encoding choice in the current doc is a good one. I
 went
 > through the Spark encoding in detail and it looks like a better
 choice than
 > the other candidate encodings for quickly accessing nested fields.
 >
 > Another reason to use the Spark type is that this is what Delta's
 variant
 > type is based on, so Parquet files in tables written by Delta could be
 > converted or used in Iceberg tables without needing to rewrite variant
 > data. (Also, note that I work at Databricks and have an interest in
 > increasing format compatibility.)
 >
 > Ryan
 >
 > On Thu, Jul 11, 2024 at 11:21 AM Aihua Xu >>> .invalid>
 > wrote:
 >
 > > [Discuss] Consensus for Variant Encoding
 > >
 > > It’s great to be able to present the Variant type proposal in the
 > > community sync yesterday and I’m looking to host a meeting next week
 > > (targeting for 9am, July 17th) to go over any further concerns
 about the
 > > encoding of the Variant type and any other questions on the first
 phase of
 > > the proposal
 > > <
 https://docs.google.com/document/d/1QjhpG_SVNPZh3anFcpicMQx90ebwjL7rmzFYfUP89Iw/edit
 >.
 > > We are hoping that anyone who is interested in the proposal can
 either join
 > > or reply with their comments so we can discuss them. Summary of the
 > > discussion and notes will be sent to the mailing list for further
 comment
 > > there.
 > >
 > >
 > >-
 > >
 > >What should be the underlying binary representation
 > >
 > > We have evaluated a few encodings in the doc including ION, JSONB,
 and
 > > Spark encoding.Choosing the underlying encoding is an important
 first step
 > > here and we believe we have general support for Spark’s Variant
 encoding.
 > > We would like to hear if anyone else has strong opinions in this
 space.
 > >
 > >
 > >-
 > >
 > >Should we support multiple logical types or just Variant?
 Variant vs.
 > >Variant + JSON.
 > >
 > > This is to discuss what logical data type(s) to be supported in
 Iceberg -
 > > Variant only vs. Variant + JSON. Both types would share the same
 underlying
 > > encoding but would imply different limitations on engines working
 with
 > > those types.
 > >
 > > From the sync up meeting, we are more favoring toward supporting
 Variant
 > > only and we want to have a consensus on the supported type(s).
 > >
 > >
 > >-
 > >
 > >How should we move forward with Subcolumnization?
 > >
 > > Subcolumnization is an optimization for Variant type by separating
 out
 > > subcolumns with their own metadata. This is not critical for
 choosing the
 > > initial encoding of the Variant type so we were hoping to gain
 consensus on
 > > leaving that for a follow up spec.
 > >
 > >
 > > Thanks
 > >
 > > Aihua
 > >
 > > Meeting invite:
 > >
 > > Wednesday, July 17 · 9:00 – 10:00am
 > > Time

Re: [DISCUSS] Enable the discussion tab for iceberg github repos

2024-07-10 Thread Russell Spitzer

I'm a fan of having more things on github if possible. I haven't used this
feature but it sounds like it could be useful.

On Wed, Jul 10, 2024 at 6:15 AM Renjie Liu  wrote:

> I’m fine with enabling it in iceberg-rust first and see how it goes.
>
> On Wed, Jul 10, 2024 at 17:39 Fokko Driesprong  wrote:
>
>> Thanks for raising this. I would also prefer discussions over a user
>> mailing-list since it has a lower barrier. We could also first enable this
>> on Iceberg-rust and evaluate it after a while to see the added value and
>> then decide for Python and Java? WDYT?
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 10 jul 2024 om 10:14 schreef Jean-Baptiste Onofré > >:
>>
>>> Yes and no :)
>>>
>>> It's a beta feature. My point is that we can enabled GitHub Discussion
>>> easily, and depending of the timing, asf.yaml support will be added.
>>>
>>> Regards
>>> JB
>>>
>>> On Tue, Jul 9, 2024 at 7:49 AM Xuanwo  wrote:
>>> >
>>> > Hi,
>>> >
>>> > > Regarding the discussion tab, it sounds good to me. It's pretty
>>> > straight forward to do by editing .asf.yaml.
>>> >
>>> > I tried this before. But the asf.yaml doesn't support controling
>>> discussion yet.
>>> > We need the help from infra team.
>>> >
>>> >
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?spaceKey=INFRA&title=git+-+.asf.yaml+features#Git.asf.yamlfeatures-GitHubDiscussions
>>> >
>>> >
>>> > On Tue, Jul 9, 2024, at 13:44, Jean-Baptiste Onofré wrote:
>>> > > Hi
>>> > >
>>> > > It's also possible to create a user mailing list if it helps.
>>> > >
>>> > > Regarding the discussion tab, it sounds good to me. It's pretty
>>> > > straight forward to do by editing .asf.yaml.
>>> > >
>>> > > Regards
>>> > > JB
>>> > >
>>> > > On Tue, Jul 9, 2024 at 5:18 AM Renjie Liu 
>>> wrote:
>>> > >>
>>> > >> Hi:
>>> > >>
>>> > >> Recently we have observed more and more user interested in
>>> iceberg-rust, and they have many questions about it, for example the
>>> status, relationship with others such pyiceberg. Slack is a great place to
>>> discussion, but is not friendly for long discussion and not easy to
>>> comment. We can also encourage user to use github issue, but it's easy to
>>> mix with true issues, e.g. feature tracking, bug tracking, etc.
>>> > >>
>>> > >> So I propose to enable the discussion tab for  repos of iceberg and
>>> subprojects such as iceberg-rust, pyiceberg, iceberg-go.
>>> >
>>> > --
>>> > Xuanwo
>>> >
>>> > https://xuanwo.io/
>>>
>>

Re: [Vote] Deprecate oauth tokens endpoint

2024-07-10 Thread Russell Spitzer

+`

On Wed, Jul 10, 2024 at 9:33 AM Renjie Liu  wrote:

> +1 (non binding)
>
> On Wed, Jul 10, 2024 at 4:35 PM roryqi  wrote:
>
>> +1.
>>
>> Driesprong, Fokko  于2024年7月10日周三 16:29写道：
>>
>>> +1 (binding)
>>>
>>> Op wo 10 jul 2024 om 10:14 schreef Jean-Baptiste Onofré >> >:
>>>
 +1 (non binding)

 NB: a few comments in the PR should be "addressed", but it's OK.

 Regards
 JB

 On Mon, Jul 8, 2024 at 6:15 PM Robert Stupp  wrote:
 >
 > Hi Everyone,
 >
 > I propose that we merge PR to "Deprecate oauth/tokens endpoint".
 >
 > The background and overall plan is discussed on this mailing list [2]
 > and this google doc [3].
 >
 > Please vote in the next 72 hours.
 >
 > Robert
 >
 >
 >
 > [1] https://github.com/apache/iceberg/pull/10603
 >
 > [2] https://lists.apache.org/thread/twk84xx7v0xy5q5tfd9x5torgr82vv50
 and
 > https://lists.apache.org/thread/wcm9ylm0nbwfrx65n8b1tpjrdhgvcx24 and
 > https://lists.apache.org/thread/qksh9j9d8h6nt6qrfl47bj76jthddb0p
 >
 > [3]
 >
 https://docs.google.com/document/d/1Xi5MRk8WdBWFC3N_eSmVcrLhk3yu5nJ9x_wC0ec6kVQ
 >
 > --
 > Robert Stupp
 > @snazy
 >

>>>

Re: [Vote] Deprecate oauth tokens endpoint

2024-07-10 Thread Russell Spitzer

+1

On Wed, Jul 10, 2024 at 11:03 AM Russell Spitzer 
wrote:

> +`
>
> On Wed, Jul 10, 2024 at 9:33 AM Renjie Liu 
> wrote:
>
>> +1 (non binding)
>>
>> On Wed, Jul 10, 2024 at 4:35 PM roryqi  wrote:
>>
>>> +1.
>>>
>>> Driesprong, Fokko  于2024年7月10日周三 16:29写道：
>>>
>>>> +1 (binding)
>>>>
>>>> Op wo 10 jul 2024 om 10:14 schreef Jean-Baptiste Onofré <
>>>> j...@nanthrax.net>:
>>>>
>>>>> +1 (non binding)
>>>>>
>>>>> NB: a few comments in the PR should be "addressed", but it's OK.
>>>>>
>>>>> Regards
>>>>> JB
>>>>>
>>>>> On Mon, Jul 8, 2024 at 6:15 PM Robert Stupp  wrote:
>>>>> >
>>>>> > Hi Everyone,
>>>>> >
>>>>> > I propose that we merge PR to "Deprecate oauth/tokens endpoint".
>>>>> >
>>>>> > The background and overall plan is discussed on this mailing list [2]
>>>>> > and this google doc [3].
>>>>> >
>>>>> > Please vote in the next 72 hours.
>>>>> >
>>>>> > Robert
>>>>> >
>>>>> >
>>>>> >
>>>>> > [1] https://github.com/apache/iceberg/pull/10603
>>>>> >
>>>>> > [2] https://lists.apache.org/thread/twk84xx7v0xy5q5tfd9x5torgr82vv50
>>>>> and
>>>>> > https://lists.apache.org/thread/wcm9ylm0nbwfrx65n8b1tpjrdhgvcx24 and
>>>>> > https://lists.apache.org/thread/qksh9j9d8h6nt6qrfl47bj76jthddb0p
>>>>> >
>>>>> > [3]
>>>>> >
>>>>> https://docs.google.com/document/d/1Xi5MRk8WdBWFC3N_eSmVcrLhk3yu5nJ9x_wC0ec6kVQ
>>>>> >
>>>>> > --
>>>>> > Robert Stupp
>>>>> > @snazy
>>>>> >
>>>>>
>>>>

Re: [VOTE] Fix property names in REST spec for statistics / partition statistics

2024-07-10 Thread Russell Spitzer

+1

On Wed, Jul 10, 2024 at 9:47 AM Amogh Jahagirdar <2am...@gmail.com> wrote:

> +1 (non-binding)
>
> On Wed, Jul 10, 2024 at 7:16 AM Piotr Findeisen 
> wrote:
>
>> +1 (non binding)
>>
>> On Wed, 10 Jul 2024 at 10:11, Jean-Baptiste Onofré 
>> wrote:
>>
>>> +1 (non binding)
>>>
>>> Regards
>>> JB
>>>
>>> On Wed, Jul 10, 2024 at 5:35 AM Eduard Tudenhöfner
>>>  wrote:
>>> >
>>> > Hey everyone,
>>> >
>>> > I propose to fix the property names in the REST spec for statistics /
>>> partition statistics so that they are properly aligned with the table spec
>>> and the implementation.
>>> >
>>> > Please vote within the next 72 hours.
>>> >
>>> > Eduard
>>>
>>

Re: [DISCUSS] Formalized File IO Properties

2024-07-10 Thread Russell Spitzer

Sounds reasonable to me

On Wed, Jul 10, 2024 at 9:28 AM Renjie Liu  wrote:

> Hi:
>
> +1 for standardizing iceberg properties. This will help to align different
> language implementations.
>
> On Wed, Jul 10, 2024 at 9:44 PM  wrote:
>
>> Hello Everyone,
>>
>> I was considering discussing the standardization of Iceberg properties,
>> and I believe this thread could be a great place to start.
>>
>> I'm writing an Iceberg client in Elixir and using the Java, Python, and
>> Rust implementations as references. However, I've had some difficulty
>> determining which configurations we must support and what each client has
>> implemented. Therefore, I agree with Xuanwo about having a separate
>> section as a single source of truth (SSOT).
>>
>> Additionally, I think it would be beneficial for each client to show what
>> it does not support. This would make it easier for users to know that a
>> particular client might not work with some configuration that their catalog
>> could define as default or override. It would also help us, as
>> contributors, to know which configurations we need to implement support for.
>>
>> For example, the "s3.signer"[1] and "s3.proxy-uri"[2] configurations only
>> exist in the Python implementation. I believe it is not clear that these
>> configurations are exclusive to Python, and they might be configurations
>> that the catalog could override or define as defaults in the get info
>> endpoint. Without an SSOT, this could be harder to track.
>>
>> Another example is the "rest.authorization-url" in Python and Rust versus
>> "oauth2_server_uri" in Java. Although this is a bit out of scope for this
>> thread, I will open another discussion topic about broader standardization
>> of available properties.
>>
>> [1]:
>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python+s3.signer&type=code
>> [2]:
>> https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20S3_PROXY_URI&type=code
>> On Wednesday, July 10th, 2024 at 7:51 AM, Fokko Driesprong <
>> fo...@apache.org> wrote:
>>
>> Hey Xuanwo,
>>
>> Thanks for raising this.
>>
>>- The S3 properties are largely covered under the S3FileIO page:
>>https://iceberg.apache.org/docs/nightly/aws/#s3-fileio. But it looks
>>like some important ones are missing indeed. I've raised an issue here
>>.
>>- For PyIceberg it only supports like a subset of the functionality,
>>and therefore also many properties are missing there.
>>- For the REST Catalog, there is an open PR to add
>> the options for GCS
>>and ADLS. It would be great to get some more eyes on there.
>>
>> That being said, I do think there is value in formalizing them. When
>> adding configuration options to PyIceberg, I'll make sure to check out the
>> Java implementation to ensure that we use the same property.
>>
>> Kind regards,
>> Fokko
>>
>> Op wo 10 jul 2024 om 09:22 schreef Xuanwo :
>>
>>> Hello everyone
>>>
>>> I've been working on the iceberg-rust FileIO recently and have found it
>>> challenging to identify all the necessary IO properties we need to support.
>>>
>>> For instance, consider AWS S3. There are no documents specifying which
>>> properties are supported by S3.
>>>
>>> The only relevant documentation I could find includes:
>>>
>>> - Iceberg AWS Integrations[1]: Does not define `s3.access-key-id` or
>>> `s3.secret-access-key`.
>>> - Pyiceberg configuration[2]: Missing several S3-related properties.
>>> - Iceberg REST Catalog[3]: Does not cover all storage services.
>>>
>>> To gather this information, we must refer to the S3FileIO Java code[4].
>>>
>>> I propose adding a separate section for agreeing upon these properties.
>>> We could create a specification that outlines all IO properties with
>>> indications of whether they are required or optional, along with their
>>> expected behaviors. This would help ensure consistency across different
>>> implementations without any conflicts.
>>>
>>>
>>> [1]: https://iceberg.apache.org/docs/latest/aws/
>>> [2]: https://py.iceberg.apache.org/configuration/#s3
>>> [3]:
>>> https://github.com/apache/iceberg/blob/eee81c59199a54e749ea58dae070eb066d9a5f9e/open-api/rest-catalog-open-api.yaml#L2737
>>> [4]:
>>> https://github.com/apache/iceberg/blob/2b21020aedb63c26295005d150c05f0a5a5f0eb2/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java#L46
>>>
>>> Xuanwo
>>>
>>> https://xuanwo.io/
>>>
>>
>>

Re: Building with JDK 21

2024-07-09 Thread Russell Spitzer

The different formatting preferences sounds annoying enough that I would
think we should just drop the Java8 support. Do we have anyone who strongly
prefers keeping Java 8 support?

As an alternative I think it would be fine if we disable the formatter when
using Java 21 and just make sure we always have tests run with Java 8 and
the formatter checks in our CI. If we go this route I think we stay with
Java 8 for formatting and save the reformat for when Java 8 is dropped
officially.

On Tue, Jul 9, 2024 at 7:32 AM Piotr Findeisen 
wrote:

> Hi,
>
> Java 21 is the latest "LTS version" released GA in September 2023.
> Some Iceberg users already run with Java 21 on production (and FWIW Trino
> runs with 22 already)
> I thought it would be nice to add support for building and testing Iceberg
> with Java 21.
>
> Conceptually this is simple (see PR
> ), but there is a caveat
> worth discussing:
> There seems to be no version of Google Java Format library that can run
> under JDK 8 and JDK 21.
> Choosing Google Java Format version dynamically is not an option, because
> different versions have slightly different formatting preferences, so
> updating formatter version requires updating the code in a handful of
> places.
>
> Question:
> do we want to add support for building and testing with Java 21?
> Ability to test with Java 21 would match what some of Iceberg users are
> doing.
> If we choose so, we would simply disable spotless formatter when build
> runs on Java 21 (or 8 if this is preferred instead)
>
> or we prefer to wait until we can drop Java 8 support
>  first, and only then add
> Java 21 support?
>
> Pre-existing context:
> the topic has been discussed on the PR here:
> https://github.com/apache/iceberg/pull/10474#discussion_r1658513019
> and it was proposed there to bring this to Dev group attention.
>
> Best,
> PF
>
>

Re: Feedback Collection: Bylaws in Iceberg

2024-06-25 Thread Russell Spitzer

Thanks for bringing this up Jack.

I think having more established rules specifically for the project is
probably a good thing to make sure outsiders see a path to becoming more
included in the project. I'm especially interested in the proposals for
more actively including newer contributors from different backgrounds in
the management of the project. I have definitely heard from others that the
project is difficult to approach and it's not especially clear on how one
goes about becoming involved other than "write some code and hope it
gets reviewed."

I'm definitely more on the cautious side with this sort of thing. I'd
probably prefer we start with a direct copy of the Apache rules we
currently follow and then go one by one through changes. I agree with
others on the list that any changes should have a clear reason why new
rules should be put in place and some measure we can use to judge the
success of that rule. I also am especially worried about the concept of
"active" members of the project, from a logistic perspective it just sounds
like a hassle and I'm not sure we gain much by tracking it.

I'm still on vacation at the moment so I haven't had too much time to
actually dig into the details here but I'll be on next week for real and
can read through everything and add any feedback I have

On Tue, Jun 25, 2024 at 12:55 PM Tyler Akidau
 wrote:

> Thank you Jack for wrangling this. These conversations are never easy, but
> I'm glad to see this happening. As a relative newcomer to the Iceberg
> effort, I am personally supportive of the idea of bylaws. I agree with Ryan
> that it's important to not overspecify and overindex on processes, but I do
> think a project that has grown to the size of Iceberg may need to start to
> function differently from the way it did when it was much smaller if it
> wants to continue moving forward quickly, effectively, and equitably.
>
> A few of my opinions on some of Jack's points below, take them for what
> they are worth:
> 1. I like the idea of guidelines on committership and PMC membership, but
> worry about overspecification limiting who might be considered. Just
> something to be careful about.
> 2. On paper active/inactive/emeritus sounds nice, but if emeritus markers
> are just advisory from an ASF rules perspective, then what's the point?
> Just making it clear to an outsider what the active community looks like at
> any given time?
> 3. PMC chair rotation: it seems a bit silly to have to consider this when
> the chair role is largely meant to be an administrative function, but given
> that the chair in every project I've seen has indeed carried outsized
> influence as a result, it does seem like a healthy exercise.
> 5. I personally prefer majority voting rules for an effort the size of
> Iceberg. The single veto power in consensus votes just makes it too easy
> for one person to filibuster changes that most of the community would like
> to see move forward. Beyond a certain point, consensus just isn't
> practical, IMO. It also motivates folks who care deeply about their -1 to
> rally support for their position, sharing the burden of influence across
> both sides of the discussion. With consensus votes, the person who is -1
> doesn't have to convince anyone, they can just block, and all of the burden
> of advocacy falls on the +1s.
>
> -Tyler
>
>
>
>
> On Tue, Jun 25, 2024 at 10:08 AM Jack Ye  wrote:
>
>> Thanks everyone for the insightful comments! I have raised a separate
>> thread for the initial trimmed down version of the proposal.
>>
>> To summarize, here are the discussion points we will have once the
>> initial version is passed:
>> 1. guidelines for committership and PMC membership
>> 2. active, inactive, emeritus status for committers and PMC members
>> 3. PMC chair rotation
>> 4. definition of subprojects for releases
>> 5. design proposal acceptance vote criteria
>> 6. security related issues reporting and fixes process
>> 7. commit-and-review for young subprojects
>>
>> I will start with topic 1, and discuss these topics one by one.
>>
>> Feel free to keep providing comments or feedback in this thread, and
>> please let me know any additional topics I did not cover in the list above!
>>
>> Best,
>> Jack Ye
>>
>>
>>
>>
>> On Tue, Jun 25, 2024 at 8:54 AM Honah J.  wrote:
>>
>>> Hi everyone,
>>>
>>> Thanks to everyone for the valuable points raised recently.
>>>
>>>
>>> I’m in favor of having bylaws that contain details on how the Iceberg
>>> community works, especially regarding “Decisions,” as this will provide
>>> clarity and guidance for all members. I would like to share some of my
>>> thoughts on this:
>>>
>>>
>>>- +1 on focusing on landing the initial version of the bylaws.
>>>Starting with what we are doing now is good. Having a foundational set of
>>>rules in place is essential for moving forward effectively.
>>>- As others mentioned, the approval process for Design Proposals
>>>merits further discussion. We might consider

Re: Summary of Iceberg Materialized View Meeting

2024-06-06 Thread russell . spitzer

Thanks for hosting it was a very helpful meeting. I really hope we can do more in the future to accelerate consensus on other proposals. I do encourage anyone on the mailing list to add your comments offline as well, especially if you have strong feelings. Iceberg is an open project and we realize not everyone can attend virtual meetings and want you to know you are welcome.On Jun 6, 2024, at 7:11 AM, Jan Kaul wrote:

Hi all,

thanks to all of you who attended the meeting yesterday! It was
great to talk to you and I think we made great progress. For those
of you who weren't able to attend the meeting, I summarized the
main points below:

Question 1: Should we store the "storage table pointer" as a
view property or as additional field in the view metadata?

We reached consensus to add a new metadata field
"storage-table" to the view
version record that stores the identifier of the the storage
table. The motivation for introducing a new field is that this
emphasizes that materialized views are part of the standard and it
enforces a common behavior.

Question 2: Where should the lineage-state information be
stored?

We reached consensus on storing the lineage-state information in
the snapshot summary of the storage table. The motivation
behind this is that the table spec should not be concerned with
defining view constructs.

Question 3: How should the lineage-state information be
represented?
We reached consensus on representing the lineage-state in the
form of nested objects and storing these as a JSON-encoded
string inside the storage table snapshot summary.

Additionally, Dan proposed to introduce a new lineage construct as
part of the view definition in addition to the lineage-state that
is part of the storage table. The idea is to separate the
concerns. The lineage-state in the storage table should only
capture the state of the source tables at the time of the last
refresh, whereas the lineage information in the view contains more
information about the source tables and is responsible for
resolving the identifiers. We haven't really decided on how the
new lineage construct should be represented or integrated into the
view metadata.

One point that we didn't really have the time to discuss was
Benny's comment of also storing the version-id of views in the
case that the materialized view is referencing a view. I think we
should also integrate that into the spec.
You can find the recording of the meeting here:
https://drive.google.com/file/d/1DE09tYS28L3xL_NgnM9g0Olbe6aHza5G/view?usp=sharing

Best wishes,

Jan

Re: [Proposal] Add support for Flink Maintenance in Iceberg

2024-05-06 Thread Russell Spitzer

+1
I'm mostly in favor of the single pipeline model but I don't see any issue with 
supporting both models. 

> On May 6, 2024, at 1:43 PM, Rodrigo Meneses  wrote:
> 
> +1
> Thanks so much for driving this Peter!
> 
> On Fri, May 3, 2024 at 11:30 AM Péter Váry  > wrote:
>> Hi everyone,
>> 
>> I would like to make a proposal [1] to support Flink Table Maintenance in 
>> Iceberg. The main goal is to have a solution where Flink can execute the 
>> Maintenance Tasks as part of the streaming job. Especially Rewrite Data 
>> Files, Rewrite Manifest Files and Expire Snapshots.
>> The secondary goal is to provide building blocks for Flink batch jobs to 
>> execute the Maintenance Tasks independently, where the scheduling is done 
>> outside of Flink.
>> 
>> This proposal is the outcome of extensive community discussions on the 
>> mailing list [2, 3].
>> 
>> Please respond with your recommendation:
>> +1 if you support moving forward with the two separate objects model.
>> 0 if you are neutral.
>> -1 if you disagree with the two separate objects model.
>> 
>> Thanks,
>> Peter
>> 
>> [1] https://github.com/apache/iceberg/issues/10264
>> [2] https://lists.apache.org/thread/yjcwbf1037jdq4prty6rtrrqmjzc71o0
>> [3] https://lists.apache.org/thread/10mdf9zo6pn0dfq791nf4w1m7jh9k3sl

Re: [VOTE] Release Apache Iceberg 1.5.2 RC0

2024-05-06 Thread Russell Spitzer

+1 (binding)

Checked all the normal things
Tests
Rat
Checksum
Internal Tests

> On May 6, 2024, at 5:35 AM, Cheng Pan  wrote:
> 
> +1 (non-binding)
> 
> Integrated with Apache Kyuubi[1], the CI covers
> 
> - Iceberg Spark 3.3-3.5 with Scala 2.12
> - Iceberg Spark 3.5 with Scala 2.13
> 
> [1] https://github.com/apache/kyuubi/pull/6361
> 
> Thanks,
> Cheng Pan
> 
> On 2024/05/01 17:25:08 Amogh Jahagirdar wrote:
>> Hi Everyone,
>> 
>> I propose that we release the following RC as the official Apache Iceberg
>> 1.5.2 release.
>> 
>> The commit ID is cbb853073e681b4075d7c8707610dceecbee3a82
>> * This corresponds to the tag: apache-iceberg-1.5.2-rc0
>> * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.2-rc0
>> *
>> https://github.com/apache/iceberg/tree/cbb853073e681b4075d7c8707610dceecbee3a82
>> 
>> The release tarball, signature, and checksums are here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.2-rc0
>> 
>> You can find the KEYS file here:
>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>> 
>> Convenience binary artifacts are staged on Nexus. The Maven repository URL
>> is:
>> * https://repository.apache.org/content/repositories/orgapacheiceberg-1163/
>> 
>> Please download, verify, and test.
>> 
>> Please vote in the next 72 hours.
>> 
>> [ ] +1 Release this as Apache Iceberg 1.5.2
>> [ ] +0
>> [ ] -1 Do not release this because...
>> 
>> Only PMC members have binding votes, but other community members are
>> encouraged to cast
>> non-binding votes. This vote will pass if there are 3 binding +1 votes and
>> more binding
>> +1 votes than -1 votes.
>> 
> 
>

Re: [Proposal] Add support for Materialized Views in Iceberg

2024-04-25 Thread Russell Spitzer

+1 to separate.

> On Apr 25, 2024, at 2:08 PM, Jean-Baptiste Onofré  wrote:
> 
> +1 to separate, it makes sense to me.
> 
> Regards
> JB
> 
> On Thu, Apr 18, 2024 at 11:50 AM Walaa Eldin Moustafa
>  wrote:
>> 
>> Hi everyone,
>> 
>> I would like to make a proposal for issue [1] to support materialized views 
>> in Iceberg. The support leverages two separate objects, an Iceberg view and 
>> an Iceberg table to implement materialized views. Each object retains 
>> relevant metadata to support the MV operations. An initial design, which we 
>> can refine, is detailed in the description section of this PR [2].
>> 
>> This proposal is the outcome of extensive community discussions in various 
>> forums [3, 4, 5, 6, 7].
>> 
>> Please respond with your recommendation:
>> +1 if you support moving forward with the two separate objects model.
>> 0 if you are neutral.
>> -1 if you disagree with the two separate objects model.
>> 
>> Thanks,
>> Walaa.
>> 
>> [1] https://github.com/apache/iceberg/issues/10043
>> [2] https://github.com/apache/iceberg/pull/9830
>> [3] 
>> https://docs.google.com/document/d/1zg0wQ5bVKTckf7-K_cdwF4mlRi6sixLcyEh6jErpGYY
>> [4] https://github.com/apache/iceberg/issues/6420
>> [5] 
>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF
>> [6] https://lists.apache.org/thread/tb3wcs7czjvjbq9y1qtr87g9s95ky5zh
>> [7] https://lists.apache.org/thread/l6cvrp4r1001k08cy2ypybzy2kgxpt1y

Re: [VOTE] Release Apache Iceberg 1.5.0 RC6

2024-03-11 Thread Russell Spitzer

+1 (binding)

Verified all the usual things
Ran full test suite
Everything looking good

> On Mar 8, 2024, at 11:35 PM, Szehon Ho  wrote:
> 
> +1 (binding)
> 
> * Verified signature
> * Verified checksum
> * RAT check
> * built JDK 11
> * Ran basic tests on Spark 3.5
> 
> Thanks
> Szehon
> 
> On Fri, Mar 8, 2024 at 5:50 PM Amogh Jahagirdar  > wrote:
>> +1 non-binding
>> 
>> Verified signatures,checksums,RAT checks, build, and tests with JDK11. I 
>> also ran ad-hoc tests for views in Trino with the rest catalog.
>> 
>> Thanks,
>> 
>> Amogh Jahagirdar
>> 
>> On Fri, Mar 8, 2024 at 5:04 PM Ryan Blue > > wrote:
>>> +1 (binding)
>>> 
>>> - Normal tarball verification
>>> - Read from my broken view successfully
>>> 
>>> On Fri, Mar 8, 2024 at 3:07 PM Daniel Weeks >> > wrote:
 +1 (binding)
 
 Verified sigs/sums/license/build/tests (Java 17)
 
 -Dan
 
 On Thu, Mar 7, 2024 at 2:10 PM Hussein Awala >>> > wrote:
> +1 (non-binding)
> - checked checksum and signature
> - built from source with jdk11
> - tested read and write with Spark 3.5.1 and Glue catalog
> 
> All looks good
> 
> On Thu, Mar 7, 2024 at 10:49 PM Drew  > wrote:
>> +1 (non-binding)
>> 
>> - verified signature and checksum
>> - verified RAT license check
>> - verified build/tests passing with JDK17
>> - ran some manual tests on Spark3.5 with GlueCatalog
>> 
>> Drew
>> 
>> On Thu, Mar 7, 2024 at 4:38 AM Ajantha Bhat > > wrote:
>>> +1 (non-binding)
>>> 
>>> * validated checksum and signature
>>> * checked license docs & ran RAT checks
>>> * ran build and tests with JDK11
>>> * verified view support for Nessie catalog with Spark 3.5.
>>> * verified this RC against Trino 
>>> (https://github.com/trinodb/trino/pull/20957)
>>> 
>>> - Ajantha
>>> 
>>> 
>>> On Wed, Mar 6, 2024 at 7:25 PM Jean-Baptiste Onofré >> > wrote:
 +1 (non binding)
 
 - checksums and signatures are OK
 - ASF headers are present
 - No unexpected binary files in the source distribution
 - Build OK with JDK11
 - JdbcCatalog tested on Trino and Iceland
 - No unexpected artifact distributed
 
 Thanks !
 
 Regards
 JB
 
 On Wed, Mar 6, 2024 at 12:04 AM Ajantha Bhat >>> > wrote:
 >
 > Hi Everyone,
 >
 > I propose that we release the following RC as the official Apache 
 > Iceberg 1.5.0 release.
 >
 > The commit ID is 2519ab43d654927802cc02e19c917ce90e8e0265
 > * This corresponds to the tag: apache-iceberg-1.5.0-rc6
 > * https://github.com/apache/iceberg/commits/apache-iceberg-1.5.0-rc6
 > * 
 > https://github.com/apache/iceberg/tree/2519ab43d654927802cc02e19c917ce90e8e0265
 >
 > The release tarball, signature, and checksums are here:
 > * 
 > https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.5.0-rc6
 >
 > You can find the KEYS file here:
 > * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
 >
 > Convenience binary artifacts are staged on Nexus. The Maven 
 > repository URL is:
 > * 
 > https://repository.apache.org/content/repositories/orgapacheiceberg-1161/
 >
 > Please download, verify, and test.
 >
 > Please vote in the next 72 hours.
 >
 > [ ] +1 Release this as Apache Iceberg 1.5.0
 > [ ] +0
 > [ ] -1 Do not release this because...
 >
 > Only PMC members have binding votes, but other community members are 
 > encouraged to cast
 > non-binding votes. This vote will pass if there are 3 binding +1 
 > votes and more binding
 > +1 votes than -1 votes.
 >
 > - Ajantha
>>> 
>>> 
>>> --
>>> Ryan Blue
>>> Tabular

Re: New committer: Bryan Keller

2024-03-05 Thread Russell Spitzer

Congratulations!

> On Mar 5, 2024, at 10:04 AM, Chris Ward  wrote:
> 
> Congrats Bryan!
>  
> From: Steve Zhang 
> Date: Tuesday, March 5, 2024 at 9:59 AM
> To: dev@iceberg.apache.org 
> Subject: Re: New committer: Bryan Keller
> 
> Congrats Bryan, well deserved!
>  
> Thanks,
> Steve Zhang
>  
>  
> 
> 
> On Mar 5, 2024, at 9:44 AM, Szehon Ho  wrote:
>  
> Congratulations Bryan, well deserved, great work on Iceberg !
>  
> On Tue, Mar 5, 2024 at 8:14 AM Jack Ye  > wrote:
> Congrats Bryan!
>  
> -Jack
>  
> On Tue, Mar 5, 2024 at 7:33 AM Amogh Jahagirdar  > wrote:
> Congratulations Bryan! Very well deserved, thank you for all your 
> contributions!
>  
> On Tue, Mar 5, 2024 at 7:29 AM Steven Wu  > wrote:
> Bryan, congratulations and thank you for your many contributions.
>  
> On Tue, Mar 5, 2024 at 5:54 AM Bryan Keller  > wrote:
> Thanks everyone! I really appreciate it, Iceberg has been inspiring to me, 
> both the project itself and the people involved, so I’m thankful to have been 
> given the opportunity to contribute!
>  
> On Tue, Mar 5, 2024 at 5:28 AM Mehul Batra  > wrote:
> Congratulations Bryan!
>  
> On Tue, Mar 5, 2024 at 1:50 PM Fokko Driesprong  > wrote:
> Hi everyone,
>  
> The Project Management Committee (PMC) for Apache Iceberg has invited Bryan 
> Keller to become a committer and we are pleased to announce that he has 
> accepted.
> 
> Bryan was contributing to Iceberg before it was even open-source, did a lot 
> of work on the topic of metadata generation, and is now leading the effort of 
> migrating the Kafka Connect integration into OSS Iceberg.
> 
> Being a committer enables easier contribution to the project since there is 
> no need to go via the patch submission process. This should enable better 
> productivity. A PMC member helps manage and guide the direction of the 
> project.
> 
> Please join me in congratulating Bryan.
> 
> Cheers,
> Fokko

Re: Partition column order in rewrite manifests

2024-01-30 Thread russell . spitzer

Sounds like a reasonable thing to add? Maybe we could check cardinality to pick out the default order as well?Sent from my iPhoneOn Jan 30, 2024, at 3:50 PM, Jack Ye  wrote:Hi everyone,Today, the rewrite manifest procedure always orders the data files based on their data_file.partition value. Specifically, it sorts data files that have the same partition value, and then does a repartition by range based on the target number of manifest files (ref),I notice that this approach does not always yield the best performance for scan planning because the resulting manifest entries order is basically based on the default order of the partition columns. For example, consider a table partitioned by columns a and b. By default the rewrite procedure will organize manifest entries based on column a and then b. If most of my queries are using b as the predicate, rewriting manifests by sorting first against column b and then a will yield a much shorter scan planning time, because all manifest entries with similar b values are close together, and manifest list can be used to prune many files already without opening the manifest files.This happens a lot for cases like b is an event time timestamp column, which is not the first partition column, but actually the column that is read most frequently in every query.Translated to code, this means we can benefit from something like:SparkActions.rewriteManifests(table)  .sort("b", "a")  .commit()Any thoughts?Best,Jack Ye

Re: [DISCUSS] Iceberg community summit

2024-01-12 Thread Russell Spitzer

I'd also like to volunteer

> On Jan 12, 2024, at 12:27 PM, Brian Olsen  wrote:
> 
> Hey Iceberg nation,
> 
> I would like to volunteer to be on the selection committee. I have a lot of 
> experience from my time working on the Trino Community. I helped run the 
> Trino Summit’s in 2021() and 2022 (
> https://trino.io/blog/2022/11/21/trino-summit-2022-recap). The selection 
> committee was Martin Traverso, Manfred Moser, and myself. We always believed 
> that the primary goal of a successful summit was enablement and bringing new 
> faces into the project, while driving net new awareness with the audiences 
> that sponsors bring. 
> 
> I’ve written on Iceberg (
> https://trino.io/blog/2021/05/03/a-gentle-introduction-to-iceberg) and more 
> recently have been focused on refactoring documentation (
> https://github.com/apache/iceberg/pull/8919) while simultaneously taking 
> stock of areas in the docs that need to be filled. I have also started work 
> on a blog series that revamps the messaging for Iceberg 101 and doing a fair 
> amount of research of where we need more discussion to lower the barrier for 
> Iceberg adoption. With the work on docs and research I’ve looked at, I would 
> love the opportunity to help build a speaker lineup that discusses 
> fundamental Iceberg architecture concepts and how those relate to real 
> problems that were solved. I would also aim to balance this with a healthy 
> focus on state-of-the-art improvements and roadmap discussions.
> 
> I hope you’ll consider me for the selection committee this year and either 
> way, I’m happy to help in any other way I can. Thanks JB and Ryan for your 
> continued work here.
> 
> On Fri, Jan 12, 2024 at 12:00 PM Alex Merced  
> wrote:
>> I'm glad to volunteer in anyway I can be helpful
>> 
>> On Fri, Jan 12, 2024 at 12:54 PM Jack Ye > > wrote:
>>> Thanks for continuing the effort! I definitely would like to volunteer if 
>>> possible!
>>> 
>>> Best,
>>> Jack Ye
>>> 
>>> On Fri, Jan 12, 2024 at 9:49 AM Ryan Blue >> > wrote:
 Hi everyone,
 
 We've been having discussions about how to put together an Iceberg 
 conference or summit for this year and one of the first steps is to put 
 together a selection committee that will be responsible for choosing talks 
 and guiding the process. Once we have a selection committee, we can put 
 together the concrete proposal for the ASF and the Iceberg PMC to request 
 the ability to use the name Iceberg.
 
 If you'd like to help and be part of the selection committee, please 
 volunteer in a reply to this thread.
 
 Since we likely can't include everyone that volunteers, I propose that the 
 PMC should choose the final committee from the set of people that 
 volunteer. We'll leave this open for the next week or so to give people 
 time.
 
 Ryan
 
 
 --
 Ryan Blue
>> 
>> 
>> --
>> Alex Merced
>> Developer Advocate
>> alex.mer...@dremio.com

Re: [PROPOSAL] Improvement on our PR flows

2024-01-03 Thread Russell Spitzer

I definitely need something to keep emailing me, so I support this.

On Wed, Jan 3, 2024 at 7:52 AM Jean-Baptiste Onofré  wrote:

> Hi guys,
>
> We have several examples where  we have some kind of "stale" PRs,
> either because we are waiting for a review, or we are waiting for
> changes from the contributor.
>
> We are already using two jobs around issues/PRs:
> - labeler to label PRs depending of the Iceberg modules change scope
> - stale to stale/close issues (we don't touch PRs in stale job today)
>
> In order to "improve" the PRs flow, I would like to propose the following:
>
> 1. We keep our labeler as it is. I propose to add
> .github/reviewers.yml to automatically add reviewers depending on the
> labels. It would look like (this is just an example, I will do a more
> concrete setup in a PR if there are no objection):
>
> labels:
>   - name: API
> reviewers:
>   - rdblue
>   - aokolnychyi
>   - Fokko
> exclusionList: []
>   - name: CORE
> reviewers:
>   - rdblue
>   - Fokko
>   - nastra
> exclusionList: []
>   - name: FLINK
> reviewers:
>   - nastra
> exclusionList: []
>...
>   fallbackReviewers:
> - rdblue
> - Fokko
> - nastra
> - jbonofre
>
> 2. We can update the stale job to add a reminder message to
> reviewer/contributor on PR. For instance, something like:
>
> name: Mark and close stale issues and pull requests
>
> on:
>   schedule:
>   - cron: '0 0 * * *'
>   workflow_dispatch:
>
> permissions: read-all
> jobs:
>   stale:
> runs-on: ubuntu-latest
> permissions:
>   issues: write
>   pull-requests: write
> steps:
> - uses: actions/stale@v9
>   with:
>   stale-issue-label: 'stale'
>   exempt-issue-labels: 'not-stale'
>   days-before-issue-stale: 180
>   days-before-issue-close: 14
>   stale-issue-message: >
> This issue has been automatically marked as stale because
> it has been open for 180 days
> with no activity. It will be closed in the next 14 days if
> no further activity occurs. To
> permanently prevent this issue from being considered
> stale, add the label 'not-stale',
> but commenting on the issue is preferred when possible.
>   close-issue-message: >
> This issue has been closed because it has not received any
> activity in the last 14 days
> since being marked as 'stale'
>   stale-pr-message: 'This pull request has been marked as
> stale due to 15 days of inactivity. It will be closed in 1 week if no
> further activity occurs. If you think that’s incorrect or this pull
> request requires a review, please simply write any comment. If closed,
> you can revive the PR at any time and @mention a reviewer or discuss
> it on the dev@iceberg.apache.org list. Thank you for your
> contributions.'
>   close-pr-message: 'This pull request has been closed due to
> lack of activity. If you think that is incorrect, or the pull request
> requires review, you can revive the PR at any time.'
> stale-pr-label: 'stale'
> days-before-pr-stale: 15
> days-before-pr-close: 7
> exempt-pr-labels: "pinned,security"
> operations-per-run: 100
>
> Thoughts ?
>
> PS: I did set up this on Apache Beam for example, and we did speed up
> the review and PR flows.
>
> Regards
> JB
>

Re: [DISCUSS] Run GC with Catalog or Tables

2023-12-06 Thread Russell Spitzer

I just think this is a bit more complicated than I want to take into the main 
library just because we have to make decisions about

1. Retries
2. Concurrency
3. Results/Error Reporting

But if we have a good proposal for we will handle all those I think we could do 
it? 

> On Dec 6, 2023, at 2:05 PM, Andrea Campolonghi  wrote:
> 
> I think that if you call an expire snapshots function this is exactly what 
> you want 
> 
> On Wed, Dec 6, 2023 at 18:47 Ryan Blue  > wrote:
>> My concern with the per-catalog approach is that people might accidentally 
>> run it. Do you think it's clear enough that these invocations will drop 
>> older snapshots?
>> 
>> On Wed, Dec 6, 2023 at 2:40 AM Andrea Campolonghi > > wrote:
>>> I like this approach. + 1
>>> 
 On 6 Dec 2023, at 11:37, naveen >>> > wrote:

 Hi Everyone,

 Currently Spark-Procedures supports expire_snapshots/remove_orphan_files 
 per table.

 Today, if someone has to run GCs on an entire catalog they will have to 
 manually run these procedures for every table.

 Is it a good idea to do it in bulk as per catalog or with multiple tables ?

 Current syntax:
 CALL hive_prod.system.expire_snapshots(table => 'db.sample', )
 Proposed Syntax something similar:

 Per Namespace/Database
 CALL hive_prod.system.expire_snapshots(database => 'db', )
 Per Catalog
 CALL hive_prod.system.expire_snapshots()
 Multiple Tables
 CALL hive_prod.system.expire_snapshots(tables => Array('db1.table1', 
 'db2.table2), )
 PS: There could be exceptions for individual catalogs. Like Nessie doesn't 
 support GC other than Nessie CLI. Hadoop can't list all the Namespaces.

 Regards,
 Naveen Kumar

>>> 
>> 
>> 
>> --
>> Ryan Blue
>> Tabular

Re: Iceberg Logo Fix and Iceberg Swag Shop

2023-12-06 Thread Russell Spitzer

Ah got it! For some reason I kept looking for a circle, but in the link you 
sent I can see the obvious polygon that is missing. 

I'm +1 on switching the image to the one offered by Tabular

> On Dec 6, 2023, at 10:01 AM, Brian Olsen  wrote:
> 
> Thanks Weston and Russell,
> 
> To see the old file, look at the Wikimedia Commons image: 
> https://en.wikipedia.org/wiki/Apache_Iceberg#/media/File:Apache_Iceberg_Logo.svg.
>  You'll notice the transparent background reveals a triangular hole. 
> 
> You can also see this in the Apache store on RedBubble when looking on 
> backgrounds that are not white: https://www.redbubble.com/shop/ap/40954182
> 
> If you look at the image we use in the Tabular newsletter, our designers 
> closed up that hole: 
> https://tabular.io/images/blog/iceberg-announcements/october-2023.webp
> 
> Those are the only public images I know of. Let me know if there are any 
> issues viewing them.
> 
> On Wed, Dec 6, 2023 at 9:53 AM Weston Pace  <mailto:weston.p...@gmail.com>> wrote:
>> BTW: ASF mailing lists strip attachments and so you will need to use a gist 
>> or some other sharing.
>> 
>> On Wed, Dec 6, 2023, 7:22 AM Russell Spitzer > <mailto:russell.spit...@gmail.com>> wrote:
>>> The original email has a broken png link so I was never able to see the 
>>> issue, could you attach the before and after so I can see the difference?
>>> 
>>>> On Dec 6, 2023, at 9:07 AM, Brian Olsen >>> <mailto:bitsondata...@gmail.com>> wrote:
>>>> 
>>>> Hey all,
>>>> 
>>>> I wanted to resurface this and see if any PMC could take a look. Thanks!
>>>> 
>>>> On Wed, Nov 1, 2023 at 8:37 AM Jean-Baptiste Onofré >>> <mailto:j...@nanthrax.net>> wrote:
>>>>> Hi Brian,
>>>>> 
>>>>> Good catch.
>>>>> 
>>>>> We need to get approval from the PMC, and notify ASF VP Brand Management 
>>>>> (Mark Thomas) by sending a message to tradema...@apache.org 
>>>>> <mailto:tradema...@apache.org>. 
>>>>> We can also be in touch with ASF comdev and marketing teams to help to 
>>>>> update the logo and so.
>>>>> 
>>>>> I can help with this, don't hesitate to ping me !
>>>>> 
>>>>> Regards
>>>>> JB
>>>>> 
>>>>> On Wed, Nov 1, 2023 at 7:56 AM Brian Olsen >>>> <mailto:bitsondata...@gmail.com>> wrote:
>>>>>> Hey Iceberg Nation,
>>>>>> 
>>>>>> I wanted to address an issue with the Iceberg Logo used by the ASF. 
>>>>>> Somewhere along the way, a hole was added to the Iceberg logo (global 
>>>>>> warming? 😬). I first noticed it when uploading the logo to the Wikipedia 
>>>>>> Commons 
>>>>>> <https://commons.wikimedia.org/wiki/File:Apache_Iceberg_Logo.svg>, but 
>>>>>> thought it was perhaps intentional at the time.
>>>>>> 
>>>>>> This came up again when I was looking for options to buy an Iceberg 
>>>>>> shirt on RedBubble from the ASF Official store 
>>>>>> <https://www.redbubble.com/shop/ap/40954182>. However, when looking at 
>>>>>> the shirts I remembered the holey Iceberg seeing the logo on a non-white 
>>>>>> shirt.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I followed up with Ryan, and he said this hole wasn't originally there 
>>>>>> and isn't supposed to be there. I want to add the RedBubble shop to the 
>>>>>> Iceberg site. I believe having a way for all of us to show our ❤️ for 
>>>>>> Iceberg is one of the best ways to build not only awareness but a common 
>>>>>> identity around the project.
>>>>>> 
>>>>>> The Tabular design team has created a fixed SVG file, I just wanted to 
>>>>>> better understand the steps necessary to get this approved by the PMC, 
>>>>>> and where we should submit this to update the ASF logo and get them to 
>>>>>> ultimately update the redbubble site. Following that, I will add a PR to 
>>>>>> add the Redbubble site with the fixed logo to our site.
>>>>>> 
>>>>>> Thanks all,
>>>>>> Bits
>>>>>> 
>>>>>>   
>>>

Re: Iceberg Logo Fix and Iceberg Swag Shop

2023-12-06 Thread Russell Spitzer

The original email has a broken png link so I was never able to see the issue, 
could you attach the before and after so I can see the difference?

> On Dec 6, 2023, at 9:07 AM, Brian Olsen  wrote:
> 
> Hey all,
> 
> I wanted to resurface this and see if any PMC could take a look. Thanks!
> 
> On Wed, Nov 1, 2023 at 8:37 AM Jean-Baptiste Onofré  > wrote:
>> Hi Brian,
>> 
>> Good catch.
>> 
>> We need to get approval from the PMC, and notify ASF VP Brand Management 
>> (Mark Thomas) by sending a message to tradema...@apache.org 
>> . 
>> We can also be in touch with ASF comdev and marketing teams to help to 
>> update the logo and so.
>> 
>> I can help with this, don't hesitate to ping me !
>> 
>> Regards
>> JB
>> 
>> On Wed, Nov 1, 2023 at 7:56 AM Brian Olsen > > wrote:
>>> Hey Iceberg Nation,
>>> 
>>> I wanted to address an issue with the Iceberg Logo used by the ASF. 
>>> Somewhere along the way, a hole was added to the Iceberg logo (global 
>>> warming? 😬). I first noticed it when uploading the logo to the Wikipedia 
>>> Commons , 
>>> but thought it was perhaps intentional at the time.
>>> 
>>> This came up again when I was looking for options to buy an Iceberg shirt 
>>> on RedBubble from the ASF Official store 
>>> . However, when looking at the 
>>> shirts I remembered the holey Iceberg seeing the logo on a non-white shirt.
>>> 
>>> 
>>> 
>>> I followed up with Ryan, and he said this hole wasn't originally there and 
>>> isn't supposed to be there. I want to add the RedBubble shop to the Iceberg 
>>> site. I believe having a way for all of us to show our ❤️ for Iceberg is 
>>> one of the best ways to build not only awareness but a common identity 
>>> around the project.
>>> 
>>> The Tabular design team has created a fixed SVG file, I just wanted to 
>>> better understand the steps necessary to get this approved by the PMC, and 
>>> where we should submit this to update the ASF logo and get them to 
>>> ultimately update the redbubble site. Following that, I will add a PR to 
>>> add the Redbubble site with the fixed logo to our site.
>>> 
>>> Thanks all,
>>> Bits
>>> 
>>>

Re: Is there a way to distcp iceberg table from hadoop?

2023-12-04 Thread Russell Spitzer

Delta now exposes this functionality as a command, and some groups (like
ours) have some internal functionality for doing this. I think it's worth
reconsidering this as a first class procedure in the Iceberg-Spark module
since we get a lot of requests about it and now position deletes are a bit
more complicated.

On Sat, Dec 2, 2023 at 5:35 PM Wing Yew Poon 
wrote:

> Aren't we forgetting about position delete files? If the table has
> position delete files, then those contain absolute file paths as well.
> We cannot add them to the table as-is. We need to rewrite them. This, I
> think, is the most painful part of replicating an Iceberg table.
> - Wing Yew
>
>
> On Sat, Dec 2, 2023 at 5:23 PM Fokko Driesprong  wrote:
>
>> Hi Dongjun,
>>
>> Thanks for reaching out on the mailinglist. Another option might be to
>> copy the data, and then use a Spark procedure, called add_files
>>  to
>> add the files to the table. Let me know if this works for you.
>>
>> Kind regards,
>> Fokko
>>
>> Op za 2 dec 2023 om 02:43 schreef Ajantha Bhat :
>>
>>> Hi,
>>>
>>> You are right. Moving Iceberg tables from storage and expecting them to
>>> function at the new location is not currently feasible.
>>> The issue lies in the metadata files, which store the absolute path.
>>>
>>> To address this, we need support for relative paths, but it appears that
>>> progress on this front has been slow.
>>> You can monitor the status of this feature at
>>> https://github.com/apache/iceberg/pull/8260.
>>>
>>> As a temporary fix, you can use the CTAS method to create a duplicate
>>> copy of the table at the desired new path.
>>>
>>> Thanks,
>>> Ajantha
>>>
>>> On Fri, Dec 1, 2023 at 10:01 PM Dongjun Hwang 
>>> wrote:
>>>
 Hello! My name is Dongjun Hwang.

 I recently performed distcp on the iceberg table in Hadoop.

 Data search was not possible because all file paths in the metadata
 directory were not changed.

 Is there a way to distcp the iceberg table?

 thang you!!

>>>

Re: Add me to slack channel

2023-11-07 Thread Russell Spitzer

There is a self invite link https://iceberg.apache.org/community/#slack <
Here but if it is no longer working let us know, we have to renew it every
few hundred invitees

On Sun, Nov 5, 2023 at 1:02 PM Sardar Khan
 wrote:

> Hi,
> I have a few questions regards to the new deltalakemigration method here:
> https://iceberg.apache.org/docs/1.3.0/delta-lake-migration/
>
> Could you please add me to the slack channel, so I could ask my questions,
>
> Best,
> Sardar Khan
>
>
>
> --
>
> The information contained in this e-mail may be confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>
>
>
>
>

Re: [VOTE] Release Apache Iceberg 1.4.2 RC0

2023-11-02 Thread Russell Spitzer

+1 - Checked all the normal things (Checksum, Tests, Rat)

> On Nov 2, 2023, at 12:38 PM, Ryan Blue  wrote:
> 
> +1
> 
> Thanks for getting this fix out, Amogh!
> 
> On Thu, Nov 2, 2023 at 9:19 AM Amogh Jahagirdar  > wrote:
>> Thanks Ajantha, I've reached out to a few more PMCs to see about getting 
>> some more votes casted. A few folks are voting on it, really appreciate 
>> everyone's help on this release.
>> 
>> Thanks,
>> 
>> Amogh Jahagirdar
>> 
>> On Wed, Nov 1, 2023 at 9:47 PM Ajantha Bhat > > wrote:
>>> Friendly reminder: We are still in need of one more binding vote. Kindly 
>>> double-check and cast your vote, if possible.
>>> 
>>> On Tue, Oct 31, 2023 at 10:48 AM Ajantha Bhat >> > wrote:
 +1 (non-binding) 
 
 - validated checksum and signature
 - checked license docs & ran RAT checks
 - ran build and tests with JDK11
 
 Thanks, 
 Ajantha 
 
 On Tue, Oct 31, 2023 at 6:32 AM Amogh Jahagirdar >>> > wrote:
> +1 non-binding
> 
> 1.) Validated checksum and signature
> 2.) RAT/license checks
> 3.) Build/tests with JDK8
> 4.) Performed an ad-hoc Spark test which would read a table with 
> corrupted offsets and verified it succeeded to verify that the new patch 
> works end to end.
>  
> Thanks,
> 
> Amogh Jahagirdar
> 
> On Mon, Oct 30, 2023 at 11:20 AM Hussein Awala  > wrote:
>> +1 (non-binding) I tested the RC with spark 3.3 and hive catalog. All 
>> looks good.
>> 
>> 
>> On Mon 30 Oct 2023 at 16:48, Eduard Tudenhoefner > > wrote:
>>> +1 (non-binding)
>>> 
>>> * validated checksum and signature
>>> * checked license
>>> * ran build and tests with JDK11
>>> 
>>> On Sat, Oct 28, 2023 at 11:09 PM Amogh Jahagirdar >> > wrote:
 Hi Everyone,
 
 I propose that we release the following RC as the official Apache 
 Iceberg 1.4.2 release.
 
 The commit ID is f6bb9173b13424d77e7ad8439b5ef9627e530cb2
 * This corresponds to the tag: apache-iceberg-1.4.2-rc0
 * https://github.com/apache/iceberg/commits/apache-iceberg-1.4.2-rc0
 * 
 https://github.com/apache/iceberg/tree/f6bb9173b13424d77e7ad8439b5ef9627e530cb2
 
 The release tarball, signature, and checksums are here:
 * 
 https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.4.2-rc0
 
 You can find the KEYS file here:
 * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
 
 Convenience binary artifacts are staged on Nexus. The Maven repository 
 URL is:
 * https://repository.apache.org/ 
 content/repositories/
  
 org
  apache iceberg-1148 
 
 
 This release includes a patch for ensuring engines can successfully 
 read tables when their split offset metadata was corrupted due to a 
 bug in 1.4.0. See https://github.com/apache/iceberg/pull/8925 for more 
 details.
 
 Please download, verify, and test.
 
 Please vote in the next 72 hours.
 [ ] +1 Release this as Apache Iceberg 1.4.2
 [ ] +0
 [ ] -1 Do not release this because...
 
 Only PMC members have binding votes, but other community members are 
 encouraged to cast
 non-binding votes. This vote will pass if there are 3 binding +1 votes 
 and more binding
 +1 votes than -1 votes.
> 
> 
> --
> Ryan Blue
> Tabular

Re: [Discussion] Move `iceberg-parquet` and `iceberg-orc` modules into `iceberg-core`

2023-11-02 Thread Russell Spitzer

Is there an alternative where we do an implementation similar to how Position 
Deletes and Data Files are currently written? Like we have the more generic 
"writers" in core but the actual implementations still live in iceberg-parquet 
or iceberg-orc?

> On Nov 2, 2023, at 9:38 AM, Ajantha Bhat  wrote:
> 
> Hi Renjie, 
> 
> I have highlighted the use case from the above mail,
>  
>> However, with the addition of partition statistics 
>> ,
>>  Iceberg's metadata (stats file) will be
>> represented in Parquet or ORC formats.
>> To enable the `iceberg-core` module to write metadata in Parquet or ORC 
>> format, it will make extensive use of the functions found in the 
>> `iceberg-parquet`
>> and `iceberg-orc` modules. However, due to a circular dependency issue, 
>> `iceberg-core` cannot directly rely on `iceberg-parquet` and `iceberg-orc`.
>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as 
>> packages within the `iceberg-core` module.
>  
> A utility for reading and writing partition statistics in Parquet format is 
> expected to take the form outlined here 
> ,
>  leveraging the `iceberg-parquet` dependency.
> 
> To facilitate on-demand partition statistics computation, this utility can 
> find a home in either `iceberg-data` or a new module that relies on both 
> `iceberg-parquet` and `iceberg-orc`. This approach would enable all engines 
> to make use of it.
> 
> However, for the synchronous calculation of statistics during insertion, 
> similar to how Trino supports Puffin stats, the `iceberg-core` module's 
> snapshot producer must have access to this utility. This presents a challenge 
> due to the existing circular dependency, as `iceberg-parquet` and 
> `iceberg-orc` already depend on `iceberg-core`.
> 
> To resolve this circular dependency issue, my proposal is to integrate them 
> as separate packages within the `iceberg-core` module. 
> I believe it's best to include them in the appropriate place during the 
> initial addition itself to support both synchronous and asynchronous writes,
> instead of adding to `iceberg-data` just for asynchronous writes and later 
> deprecating and moving them to core during synchronous write implementation.  
> 
> Moving them to `iceberg-core` can also open up the possibility of writing 
> existing metadata (like manifests, manifests lists) in Parquet or ORC instead 
> of avro in future.
> 
> Thanks, 
> Ajantha 
> 
> On Thu, Nov 2, 2023 at 5:07 PM Renjie Liu  > wrote:
>> Hi:
>> 
>> Could you provide concrete cases to elaborate this change?
>> 
>> On Thu, Nov 2, 2023 at 4:22 PM Gabor Kaszab > > wrote:
>>> Hey Ajantha,
>>> 
>>> Wouldn't this require a major version bump considering this is a breaking 
>>> change for users depending on iceberg-parquet or iceberg-orc now?
>>> 
>>> Gabor
>>> 
>>> On Thu, Nov 2, 2023 at 3:01 AM Ajantha Bhat >> > wrote:
 Hi Everyone, 

 At present, Iceberg exclusively utilizes Avro, JSON, and Puffin formats to 
 handle metadata. Few discussions in the past have explored the possibility 
 of supporting these existing metadata in Parquet or ORC format. However, 
 with the addition of partition statistics 
 ,
  Iceberg's metadata (stats file) will be 
 represented in Parquet or ORC formats. 

 To enable the `iceberg-core` module to write metadata in Parquet or ORC 
 format, it will make extensive use of the functions found in the 
 `iceberg-parquet` 
 and `iceberg-orc` modules. However, due to a circular dependency issue, 
 `iceberg-core` cannot directly rely on `iceberg-parquet` and 
 `iceberg-orc`. 
 Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as 
 packages within the `iceberg-core` module.

 For end users, the main change in the new release package will be the 
 absence of separate `iceberg-parquet` and `iceberg-orc` JAR files. 
 Instead, they can 
 depend on `iceberg-core` (which they were likely doing already). This 
 change will also be clearly documented in the release notes.

 I would appreciate hearing your thoughts on this proposal.

 For a detailed look at the code changes required to implement the 
 integration of `iceberg-parquet` into `iceberg-core`, 
 please refer to the following PR: 
 https://github.com/apache/iceberg/pull/8500

 Thanks, 
 Ajantha

Re: [PROPOSAL] Use Microsoft Style Guide for documentation

2023-11-02 Thread Russell Spitzer

+1

> On Nov 1, 2023, at 6:13 PM, Yufei Gu  wrote:
> 
> +1 Love the following example. Not sure if Vale can catch this and provide 
> suggestions. It may be only possible with LLM.
>> Replace this: If you're ready to purchase Office 365 for your organization, 
>> contact your Microsoft account representative.
>> With this: Ready to buy? Contact us.
> 
> Yufei
> 
> 
> On Wed, Nov 1, 2023 at 12:20 PM Ryan Blue  > wrote:
>> +1
>> 
>> On Wed, Nov 1, 2023 at 6:38 AM Jean-Baptiste Onofré > > wrote:
>>> Hi Brian
>>> 
>>> I like the proposal, it sounds like a good way to "align" our documentation.
>>> 
>>> Thanks !
>>> Regards
>>> JB
>>> 
>>> On Wed, Nov 1, 2023 at 8:20 AM Brian Olsen >> > wrote:
>>> >
>>> > Hey Iceberg Nation, As I've gone through the Iceberg docs, I've noticed a 
>>> > lot of inconsistencies with terminology, grammar, and style. As a 
>>> > distributed community, we have a lot of non-native English speakers 
>>> > reading and writing our documentation. I propose we adopt the Microsoft 
>>> > Style Guide to improve the communication and consistency of the docs. 
>>> > Common rules like defaulting to use present tense not only make the 
>>> > documentation consistent but also more accessible for those who struggle 
>>> > to understand complex conjugations. Then there are examples like making 
>>> > sure to capitalize proper nouns like (Spark, Flink, Trino, Apache 
>>> > Software Foundation, etc...). You may think, that's great Brian, but good 
>>> > luck getting everyone reading the project and following that. I also want 
>>> > to propose adding a prose linter called Vale, that will enable us to add 
>>> > the existing rules for the Microsoft Style Guide, and our own custom 
>>> > rules to ensure consistent style with documentation changes.
>>> > Let's discuss this in the sync tomorrow! Bits
>> 
>> 
>> --
>> Ryan Blue
>> Tabular

Re: Proposal: Introduce deletion vector file to reduce write amplification

2023-10-09 Thread Russell Spitzer

The main things I’m still interested are alternative approaches. I think that some of the work that Anton is working on have shown some different bottlenecks in applying delete files that I’m not sure are addressed by this proposal.For example, this proposal suggests doing a 1 to 1 (or 1 rowgroup to 1) delete file application in order to speed up planning. But this could as be done with a puffin file indexing delete files to data files. This would eliminate any planning cost while also allowing us to do more complicated things like mapping multiple data files to a single delete file as well as operate on a one to many data file to delete file approach. Doing this approach would mean we would need to change any existing metadata or introduce a new separate file type.I think basically for every “benefit” outlined we should think about if there is an alternative approach that would achieve the same benefit. Then we should analyze or whether or not the proposal is the best solution for that particular benefit and do some work to calculate what that benefit would be and what drawbacks there might be.I would also expect some POC experiments showing that the Spec is getting the benefit’s that are hypothesized.The proposal I think also needs to address any possible limitations with this approach. They don’t all need to be solved but we should at least being exploring them. As a quick example, how does using single delete files interact with our commit logic? I would guess that a single delete file approach would make it more difficult to perform multiple deletes concurrently?Sent from my iPadOn Oct 8, 2023, at 9:22 PM, Renjie Liu wrote:Hi, Ryan:Thanks for your reply.1. What is the exact file format for these on disk that you're proposing? Even if you're saying that it is what is produced by roaring bitmap, we need more information. Is that a portable format? Do you wrap it at all in the file to carry extra metadata? For example, the proposal says that a starting position for a bitmap would be used. Where is that stored?Sorry for the confusion, by file format I mean roaring bitmap's file format. I checked that it has been implemented in several languages, such as java, go, rust, c. Metadata will be stored in manifest file as other entries such as datafile, deletion file. The starting position doesn't need to be stored since it's used by the file reader. I think your suggestion to provide an interface in design will make things clearer, and I will add it to the design doc.2. How would DML operations work? Just a sketch would be great. I don't think it is a good idea to leave the implications for DML fuzzy.I'll add sketches for other DML operations.3. The comparison appears to be between rewriting data files and using delete vectors. I think it needs to compare the existing delete file formats to delete vectors so that we know why there is a benefit to doing this beyond using the current positional delete files. The doc states that there aren't measurements here, which I think we need. Otherwise, should we just have a version of DML that produces one position delete per data file?I think deletion vector files are quite similar to position delete files, e.g. you can think of a deletion vector file as one position delete per data file. But this change brings new chances for optimization, and there is one section talking about it in the design doc. As with the measurements, I'll try to design some experiments for it.4. I think this is missing some justification for how you're changing data file metadata.I agree with your comment that if we associate one deletion vector with a data file, maybe it's better to extend the DataFile struct rather than introducing new entries.I'll update the doc to address the comments. On Mon, Oct 9, 2023 at 1:44 AM Ryan Blue wrote:Thanks, Renjie. I went through and made some comments about what is still not clear. Here's a summary:1. What is the exact file format for these on disk that you're proposing? Even if you're saying that it is what is produced by roaring bitmap, we need more information. Is that a portable format? Do you wrap it at all in the file to carry extra metadata? For example, the proposal says that a starting position for a bitmap would be used. Where is that stored?2. How would DML operations work? Just a sketch would be great. I don't think it is a good idea to leave the implications for DML fuzzy.3. The comparison appears to be between rewriting data files and using delete vectors. I think it needs to compare the existing delete file formats to delete vectors so that we know why there is a benefit to doing this beyond using the current positional delete files. The doc states that there aren't measurements here, which I think we need. Otherwise, should we just have a version of DML that produces one position delete per data file?4. I think this is missing some justification for how you're changing data file metadata.On Sat, Oct 7, 2023 at 4:49 AM Renji

Re: Behavior of dropping table with HadoopCatalog

2023-08-30 Thread russell . spitzer

There is no way to drop a Hadoop catalog table without removing the directory so I’m not sure what the alternative would beSent from my iPhoneOn Aug 29, 2023, at 10:10 PM, Manu Zhang  wrote:Hi all,The current behavior of dropping a table with HadoopCatalog looks inconsistent to me. When stored at the default location under table path, metadata and data will be deleted regardless of the "purge" flag. When stored elsewhere, they will be left there if not "purge". Is this by design?@Overridepublic boolean dropTable(TableIdentifier identifier, boolean purge) {  if (!isValidIdentifier(identifier)) {throw new NoSuchTableException("Invalid identifier: %s", identifier);  }  Path tablePath = new Path(defaultWarehouseLocation(identifier));  TableOperations ops = newTableOps(identifier);  TableMetadata lastMetadata = ops.current();  try {if (lastMetadata == null) {  LOG.debug("Not an iceberg table: {}", identifier);  return false;} else {  if (purge) {// Since the data files and the metadata files may store in different locations,// so it has to call dropTableData to force delete the data file.CatalogUtil.dropTableData(ops.io(), lastMetadata);  }  return fs.delete(tablePath, true /* recursive */);}  } catch (IOException e) {throw new RuntimeIOException(e, "Failed to delete file: %s", tablePath);  }}Thanks, Manu

Re: [PROPOSAL] Expose Apache Iceberg Slack data for hacking/education for the community

2023-08-08 Thread Russell Spitzer

I'm +1 as long as Slack TOS are ok with it. We already have full public
archives of the mailing list and I see slack as just an extension of the
mailing list.

On Tue, Aug 8, 2023 at 4:18 PM Brian Olsen  wrote:

> Hey Iceberg Nation,
>
> I wanted to propose having the public Apache Iceberg Slack
>  chat and user data for the community
> to use as a public data source. I have a couple of specific use cases in
> mind that I would like to use it for, hence what brought me to ask about it.
>
> The main problem I want to address for the community is the lack of
> persistence of the answers we're generating in Slack. Slack is on a free
> version that only retains the last 60 days of valuable threads happening
> there. Questions are repeatedly asked, and this takes up time for everyone
> in the community to answer the same questions multiple times. If we publish
> the public chat and user data (i.e. no emails or user info outside of
> what's displayed in Slack), then we can address this in the following ways:
>
>1. We can use this as a getting started tutorial
>featuring pyIceberg is to pull this dataset into a python or SQL ecosystem
>to learn about Iceberg, but also to discover old conversations that no
>longer appear on Slack. We can also take the raw data and push it into a
>local chatbot for folks to ask questions locally, build analytics projects
>etc...
>2. For those that are less interested in building your own chatbot or
>data pipeline, once this data is available, Tabular could use it to build
>and maintain a Discourse Forum  (not to be
>confused with Discord). There are many reasons to add this on top of Slack,
>like persistence, discoverability via Google, curation and organization
>into wiki style to the point answers, and gamification, to make the goal
>that it's not just Tabular moderating this, but that the community takes
>over as they build trust similar to Stack Overflow. Of course, once we have
>the initial community working together there, we can use both Slack for
>faster messaging, and migrate specific valuable conversations to Discourse
>once it is done.
>3. Another idea, would be that we could also use the Discourse forum
>as one of the inputs to create some sort of chatbot experience, either in
>Slack or nested in the docs. This would likely outperform just directly
>training on Slack data as answers in Slack aren't verified and curated to
>the most concise form possible.
>4. The Slack and Tabular Discourse forum would be public to read, so
>this would allow for other companies in the space to build their own
>solutions.
>
>
> The idea is that we would run a daily job that would export the Slack logs
> to some public dumping ground (GitHub or something) to store this dataset.
> Again, only public data that you could see if you signed up and logged into
> Slack would be exposed.
>
> How does this sound to everyone? Let me know if you have any questions or
> other ideas!
>
> Bits
>

Re: Broken slack invite

2023-07-24 Thread Russell Spitzer

https://github.com/apache/iceberg-docs/commit/a42abbf9e7cda62ac4d94943599e840d4342d6c5
 


It was just updated but I don't think the docs have been republished yet

> On Jul 23, 2023, at 7:34 AM, Bruno Murino  wrote:
> 
> Hi,
> 
> I’m trying to access the slack workspace for Apache Iceberg but I think the 
> link is broken.
> 
> Can I be added please?
> 
> Cheers,
> 
> Bruno Murino

Re: Location of rust repo

2023-07-19 Thread Russell Spitzer

+1, If the folks working on Rust want it in the main repo I have no issues
with that but it should be their choice :)

On Wed, Jul 19, 2023 at 12:47 PM Ryan Blue  wrote:

> I don't have a strong opinion here. I'd probably lean toward having it in
> the main repo to get more eyes on the PRs, but I think it's primarily up to
> the people contributing to the project.
>
> On Wed, Jul 19, 2023 at 2:30 AM Jan Kaul 
> wrote:
>
>> Hey all,
>>
>> we just had our first sync for the rust iceberg developers and it was
>> great to talk to everyone.
>>
>> The most important point that came up was the location where the rust
>> development should take place. The two options are either to have a
>> separate "iceberg-rust" repository or to create a "rust" folder in the
>> existing apache/iceberg repository.
>>
>> The benefits of a separate repository are separate CI, simpler merging
>> of PRs and a more scalable solution if more languages are added.
>>
>> The benefits of a subfolder in the existing repository are more
>> visibility, easier coordination with the java project and more feedback
>> from the community.
>>
>> The developers currently working on the rust implementation slightly
>> favor a separate repository but would be okay with using the existing
>> repository.
>>
>>
>> It would be great if you could share your opinions on the topic. Maybe
>> this could also be a point for the community sync later today.
>>
>> Hope you're all doing well. Best wishes,
>>
>> Jan
>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [PROPOSAL] Preparing first Apache Iceberg Summit

2023-07-19 Thread Russell Spitzer

I would love to be involved if possible. I'm a bit short on time though but
can definitely contribute async time to planning.

On Wed, Jul 19, 2023 at 9:35 AM Jean-Baptiste Onofré 
wrote:

> Hi guys,
>
> Following the previous email about Apache Iceberg Summit, please find
> a document introducing the summit organization:
>
>
> https://docs.google.com/presentation/d/1iy2-WdVQYTwJOrwi7pFYh_x9xHNuGV5lT1yK1g194To/edit?usp=sharing
>
> I'm kindly doing a Call For Action: anyone interested to help in the
> organization and participate to the committees, please let me know.
> I would like to schedule a meeting with all interested parties.
>
> Thanks !
>
> Regards
> JB
>
> On Wed, Jul 5, 2023 at 4:37 PM Jean-Baptiste Onofré 
> wrote:
> >
> > Hi everyone,
> >
> > I started a discussion on the private mailing list, and, as there are
> > no objections from the PMC members, I'm moving the thread to the dev
> > mailing list.
> >
> > I propose to organize the first Apache Iceberg Summit \o/
> >
> > For the format, I think the best option is a virtual event with a mix of:
> > 1. Dev community talks: architecture, roadmap, features, use in
> "products", ...
> > 2. User community talks: companies could present their use cases, best
> > practices, ...
> >
> > In terms of organization:
> > 1. no objection so far from the PMC members to use Apache Iceberg
> > Summit name. If it works for everyone, I will send a message to the
> > Apache Publicity & Marketing to get their OK for the event.
> >  2. create two committees:
> >   2.1. the Sponsoring Committee gathering companies/organizations
> > wanting to sponsor the event
> >   2.2. the Program Committee gathers folks from the Iceberg community
> > (PMC/committers/contributors) to select talks.
> >
> > My company (Dremio) will “host” the event - i.e., provide funding, a
> > conference platform, sponsor logistics, speaker training, slide
> > design, etc..
> >
> > In terms of dates, as CommunityOverCode Con NA will be in October, I
> > think January 2024 would work: it gives us time to organize smoothly,
> > promote the event, and not in a rush.
> >
> > I propose:
> > 1. to create the #summit channel on Iceberg Slack.
> > 2. I will share a preparation document with a plan proposal.
> >
> > Thoughts ?
> >
> > Regards
> > JB
>

Re: Slack invitation link

2023-07-17 Thread Russell Spitzer

https://iceberg.apache.org/community/#slack

Sorry Forgot the link

On Mon, Jul 17, 2023 at 2:05 PM Russell Spitzer 
wrote:

> You shouldn't need one, does this link work?
> They only allow a certain number of joins, once they hit that we have to
> add a new link. But they don't tell us when the limit is hit
>
> On Mon, Jul 17, 2023 at 1:50 PM Jacob Marble 
> wrote:
>
>> Good morning,
>>
>> I'm looking for an invitation to join the Apache Iceberg workspace on
>> Slack. However, the link referenced from the Community page indicates that
>> I need an @apache.org email address. Is this true, or can I be invited
>> as this email address?
>>
>> Thanks,
>>
>> --
>> Jacob Marble
>> 🇺🇸 🇺🇦
>>
>

Re: Slack invitation link

2023-07-17 Thread Russell Spitzer

You shouldn't need one, does this link work?
They only allow a certain number of joins, once they hit that we have to
add a new link. But they don't tell us when the limit is hit

On Mon, Jul 17, 2023 at 1:50 PM Jacob Marble 
wrote:

> Good morning,
>
> I'm looking for an invitation to join the Apache Iceberg workspace on
> Slack. However, the link referenced from the Community page indicates that
> I need an @apache.org email address. Is this true, or can I be invited as
> this email address?
>
> Thanks,
>
> --
> Jacob Marble
> 🇺🇸 🇺🇦
>

Re: Iceberg docs pull requests

2023-07-14 Thread Russell Spitzer

One of the issues is we kind of have a dual repo doc process. Most doc
changes that are versioned are made in the main oss apache repo and they
are copied over when Iceberg is released. So changes against the doc repo
are only for fixing past docs or non-versioned pages.

I'll take a quick look at the outstanding PR's but most of them seem to be
awaiting changes

On Fri, Jul 14, 2023 at 10:05 AM Zsolt Miskolczi 
wrote:

> Hey team!
>
> First of all, thank you for giving us Iceberg. I really love the concept
> of how Iceberg structures the files and I'm pretty sure it is the feature
> of storing files in big data.
>
> However, the community is pretty active about developing Iceberg, I have a
> feeling that documentation doesn't get the attention that it deserves.
>
> I saw the open pull requests in iceberg-docs
>  and I couldn't not notice
> that there are pull requests that are more than a month old and didn't
> receive any review at all.
>
> Can I draw your attention to the community about documentation?
>
> Thank you,
> Zsolt Miskolczi
>
>

Re: iceberg and s3a compatibility

2023-07-11 Thread russell . spitzer

The long story short is that Iceberg itself is a commit protocol. So you don’t 
have to configure any Hadoop commit protocols. Iceberg doesn’t use those 
methods because its metadata structure doesn’t rely on the location of data 
files as information about the state of those files. It can just write files 
directly into their final location. Only when the metadata is updated are those 
files actually live. Check the docs for intro blogs and videos.

As for configuration iceberg uses its own Parquet and ORC writing libraries, so 
none of the spark properties will actually work. You don’t need any of the 
spark ones either and the defaults are adequate for most use cases. Check the 
iceberg documentation for more information of properties for configuring 
icebergs’s parquet and ORC writers.

Sent from my iPhone

> On Jul 11, 2023, at 11:59 AM, Perfect Stranger  wrote:
> 
> 
> Hello. I am currently reading this: 
> https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html
>  and learning about the s3a committers.
> 
> It's a bit confusing and it seems like you need to be an expert in order to 
> properly use these committers. Because you don't just write to an s3a path 
> and use standard spark configs, you also need to provide configs for the s3a 
> committers...
> 
> I also saw this: https://github.com/rdblue/s3committer and it says that 
> people should just use iceberg. Does that mean that with iceberg you just 
> write to an s3a path and you don't have to specify which committer 
> (partitioned, directory, magic) to use and everything works optimally? 
> 
> Does iceberg have its own committers or something? I know that s3a's staging 
> committers, for example, require big enough local storage and hdfs, while 
> s3a's magic committer doesn't... Which makes me wonder if iceberg has any 
> requirements also...
> 
> Spark also has this guide:
> https://spark.apache.org/docs/latest/cloud-integration.html
> 
> in which it recommends these settings for parquet:
> spark.hadoop.parquet.enable.summary-metadata false
> spark.sql.parquet.mergeSchema false
> spark.sql.parquet.filterPushdown true
> spark.sql.hive.metastorePartitionPruning true
> And these settings for orc:
> spark.sql.orc.filterPushdown true
> spark.sql.orc.splits.include.file.footer true
> spark.sql.orc.cache.stripe.details.size 1
> spark.sql.hive.metastorePartitionPruning true
> Should I specify these settings when using parquet or orc with iceberg in 
> spark?
> 
> Thank you.

Re: Ad-hoc partition bucketing

2023-07-05 Thread russell . spitzer

We have been discussing something like this as well, either an arbitrary 
partitioning scheme or just a more extensive and customizable transform.

An example I’m interested in is a geo hash index where we store offsets on a 
large grid to denote partitions. The total offset file for the whole planet 
still only ends up being in the low megabytes while accounting for high density 
in cities and low density over oceans

Sent from my iPhone

> On Jul 4, 2023, at 8:08 AM, Joseph Allemandou  
> wrote:
> 
> 
> Hi Iceberg team,
> 
> I'm working at the WikimediaFoundation, and we started using Iceberg for some 
> of our big-data tables - we love it :)
> 
> One of the needs we'll have in the future would be to partition data using a 
> specific bucketing function.
> How complex would that be to add a new function to the ones already present 
> in the Iceberg partitioning mechanism? Is there any docs on doing that?
> Bonus points: Are there any plans to make it possible for users to reference 
> their own bucketing functions at table definition?
> 
> Many thanks for the awesome project<3
> 
> -- 
> Joseph Allemandou (joal) (he / him)
> Staff Data Engineer
> Wikimedia Foundation

Re: How to remove an Iceberg partition that only contains parquet files with 0 record

2023-06-30 Thread russell . spitzer

You probably will need to manually delete the file entry using the table api from JavaSent from my iPhoneOn Jun 30, 2023, at 6:58 AM, Pucheng Yang  wrote:Hi Manu, the table has already been migrated to Iceberg and I think your command only available to Hive table. It seems won’t help my case. Appreciate your response!On Thu, Jun 29, 2023 at 11:38 PM Manu Zhang  wrote:You may try following SQL which is supported by Sparkalter table identifier drop partition(partition_col_name=partition_col_value)Pucheng Yang 于2023年6月30日 周五11:13写道：Iceberg version: 1.3.0Spark version: 3.2.1Hi community,I have an interesting situation where I migrated a Hive table to Iceberg and this original Hive table has a partition containing parquet files without any record. The delete statement can not get rid of this partition, any suggestion on how to deal with this? Thanks!Best,Pucheng

Re: Iceberg old partition gc

2023-06-02 Thread Russell Spitzer

I think "soft-mode" is really just doing the delete. You can then recover
the snapshot if you happen to have accidentally TTL'd a partition.

On Fri, Jun 2, 2023 at 8:51 AM Szehon Ho  wrote:

> I think this violates Iceberg’s assumption of immutable snapshots.  That
> would require modifying the old snapshot to no longer point to those gc’ed
> data files, else not sure how you can time-travel to read from that
> snapshot, if some of its files are deleted?
>
> That being said, I also had this thought at some point, to keep snapshot
> info around longer.  I expect most organizations operate in a mode where
> they expire snapshots after a few days, and reasonably expect any
> time-travel or snapshot-related operation (like CDC) to happen within this
> timeframe.   And of course, use tags to keep the snapshot from expiration.
>
> But there are some use-cases where keeping more snapshot metadata for a
> period longer than when it could be read could be interesting.  For
> example, if I want to know info about the snapshot that added each data
> file, we probably have lost most of those snapshot metadata as they were
> added long ago.  Example, the frequent ask to find each partition's last
> modified time, (in an earlier email thread).
>
> I haven't thought it completely through, but it crossed my mind that a
> ‘Soft’-mode of ExpireSnapshot may be useful, where we can delete data files
> but just mark snapshot’s metadata files as expired without physically
> deleting them, and so retain the ability to answer these questions.  It
> could be done by adding ‘expired-snapshots’ list to metadata.json.  That
> being said, its a singular use case and not sure if anyone also has
> interest or other use-case?  It would add a bit of complexity.
>
> Thanks
> Szehon
> Szehon
>
> On Fri, Jun 2, 2023 at 7:12 AM Pucheng Yang 
> wrote:
>
>> Ryan,
>>
>> One use case is the user might need to time travel to a certain snapshot.
>> However, such a snapshot is expired due to the snapshot expiration
>> that only retains the latest snapshot operation, and this operation's only
>> intent is to remove the gc partition. It seems a little overkill to me.
>>
>> I hope my explanation makes sense to you.
>>
>> On Thu, Jun 1, 2023 at 3:39 PM Ryan Blue  wrote:
>>
>>> Pucheng,
>>>
>>> What is the use case around keeping the snapshot longer? We don't often
>>> have people ask to keep snapshots that can't be read, so it sounds like you
>>> might have something specific in mind?
>>>
>>> Ryan
>>>
>>> On Wed, May 31, 2023 at 8:19 PM Pucheng Yang 
>>> wrote:
>>>
 Hi community,

 In my organization, a big portion of the datasets are partitioned by
 date, normally we keep the latest X dates of partition for a given dataset.

 One issue that always bothers me is if I want to delete a partition
 that should be GC, I will run SQL query "delete from tbl where dt = ..."
 and do snapshot expiration to keep the latest snapshot to make sure that
 partition data is physically removed. However, the downside of this
 approach is the table snapshot history will be completely lost..

 I wonder if anyone else in the community has the same pain point? How
 do you solve this? I would love to understand if there is a solution to
 this otherwise we can brainstorm if there is a way to solve this.

 Thanks!

 Pucheng

>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: How to use Apache Iceberg cli

2023-06-01 Thread russell . spitzer

Iceberg does not have its own cli or a web ui. The spark-shell (spark cli) or 
Trino are usually recommended for early testing. The instructions for 
interacting are in the other spark doc pages where you found the getting 
started docs.

Sent from my iPhone

> On Jun 1, 2023, at 9:40 AM, Arvind Dige  wrote:
> 
> 
> Any solution on this??
> From: Arvind Dige
> Sent: Thursday, June 1, 2023 11:21 AM
> To: dev@iceberg.apache.org 
> Subject: How to use Apache Iceberg cli
>  
> Hi Team,
> 
> I am using the open source iceberg-1.2.1 for development purposes. where I 
> have successfully build the iceberg using steps mentioned in below github 
> repo https://github.com/apache/iceberg 
>  I am able to configure and use it with the spark using below steps 
> https://iceberg.apache.org/docs/latest/getting-started/ 
> But now I want to use iceberg shell and web ui but unfortunately I didn't 
> found any supporting documentation. So please help me on this.
> 
> Regards,
> Arvind

Re: 👋 Intro and question for the community

2023-05-30 Thread Russell Spitzer

I'm not too worried. My hackles just go up when someone needs special
permissions for something I can do with my eyeballs :) I'm guessing maybe
this has something to do with the Github TOS? This does seem like
functionality Github would like to be able to control through a restricted
API I guess.

Anyway I'm fine with using this tool as long as we make it available to
everyone in the community to see. Would there be any issues with that?

On Tue, May 30, 2023 at 11:23 AM Brian Olsen 
wrote:

> Great question!
>
> I asked the same questions to Common Room and this is what they responded
> with:
>
> So with the app, we can pull deltas. With using the method with our own
>> auth, we don’t. I’m not sure if it’s a limitation in how it was written,
>> since our auth was written years ago, before we supported all the activity
>> types. But I do understand that activities like stars would have to be
>> repulled every time with our method, so we opt not to do that.
>
>
> I looked into how their competitor does it
> <https://orbit.love/docs/all/github-integration#aed6c486fcd746098531d5dda92a641c>
> and they also do the same. I'm not sure why. I imagine scraping is much
> more difficult and limited as the website is always subject to change. We
> used Orbit in the Trino community and none of the data they scraped
> required private access. So my best guess is that GitHub doesn't allow
> incremental updates and no way of scraping the site without pulling
> everything, all at once. Which is what they do when you run a proof of
> concept with them.
>
> Other Apache communities have connected with them so maybe a next step
> could be to reach out to some of those communities or I'd be happy to bring
> anyone who is interested on to a phone call with Common Room to ask any
> other quesitons.
>
> Let me know what you think.
>
>
>
> On Tue, May 30, 2023 at 10:54 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> Could you please elaborate on what Common room really is and why it needs
>> special permissions? I'm would have thought just generic public access
>> would be enough to check PR's, Issues and such?
>>
>> On Tue, May 23, 2023 at 7:17 PM Anton Okolnychyi
>>  wrote:
>>
>>> Seems valuable to me.
>>>
>>> - Anton
>>>
>>> On May 18, 2023, at 2:44 PM, Brian Olsen 
>>> wrote:
>>>
>>> Hey all,
>>> My name is Brian and I'm the new Head of Developer Relations working at
>>> Tabular. I'd like to set up Common <https://www.commonroom.io/> Room
>>> <https://www.commonroom.io/> for us to have a bit of a pulse on the
>>> community. I would like to see if the community is interested in enabling
>>> read-only
>>> <https://docs.commonroom.io/get-started/integrations/github#required-permissions>
>>> permissions for the apache/iceberg and apache/icberg-docs for the GitHub
>>> integration. Here's how the information would be used:
>>>
>>>- Triage issues and PRs
>>>- Learn ways to improve developer/contributor experience in the
>>>community
>>>- Understand which PRs and issues are not getting attention and why
>>>- Set alerts and notifications for the Developer Relations team to
>>>follow up on issues to help drive changes in Iceberg
>>>- Metrics reporting to showcase Iceberg usage to drive further
>>>adoption and interest in Iceberg
>>>- Gaining a better understanding of the ways people use Iceberg and
>>>the features they are interested in
>>>- Showcase the diversity of contributions the Iceberg project
>>>
>>> Is everyone okay with me setting this up so I can help the community
>>> with things like roadmap updates and making sure we follow up on reviews?
>>>
>>>
>>>

Re: 👋 Intro and question for the community

2023-05-30 Thread Russell Spitzer

Could you please elaborate on what Common room really is and why it needs
special permissions? I'm would have thought just generic public access
would be enough to check PR's, Issues and such?

On Tue, May 23, 2023 at 7:17 PM Anton Okolnychyi
 wrote:

> Seems valuable to me.
>
> - Anton
>
> On May 18, 2023, at 2:44 PM, Brian Olsen  wrote:
>
> Hey all,
> My name is Brian and I'm the new Head of Developer Relations working at
> Tabular. I'd like to set up Common  Room
>  for us to have a bit of a pulse on the
> community. I would like to see if the community is interested in enabling
> read-only
> 
> permissions for the apache/iceberg and apache/icberg-docs for the GitHub
> integration. Here's how the information would be used:
>
>- Triage issues and PRs
>- Learn ways to improve developer/contributor experience in the
>community
>- Understand which PRs and issues are not getting attention and why
>- Set alerts and notifications for the Developer Relations team to
>follow up on issues to help drive changes in Iceberg
>- Metrics reporting to showcase Iceberg usage to drive further
>adoption and interest in Iceberg
>- Gaining a better understanding of the ways people use Iceberg and
>the features they are interested in
>- Showcase the diversity of contributions the Iceberg project
>
> Is everyone okay with me setting this up so I can help the community with
> things like roadmap updates and making sure we follow up on reviews?
>
>
>

Re: Iceberg transaction support with spark sql

2023-05-25 Thread Russell Spitzer

We also have branch and merge support, which may be a bit easier to use

> On May 25, 2023, at 4:33 PM, Anton Okolnychyi  
> wrote:
> 
> Unfortunately, Spark SQL does not have an API for transactions. However, you 
> may use Iceberg WAP to stage multiple changes and commit them as a single 
> Iceberg transaction.
> 
> - Anton
> 
>> On May 25, 2023, at 12:49 PM, Gaurav Agarwal  wrote:
>> 
>> Hi
>> 
>> We want to delete and insert rows in iceberg in one transaction and we are 
>> using spark SQL to execute delet e queries is there a way we can use it or 
>> implement in our application.
>> 
>> I see somewhere dynamo db manager in your code will that help
>> 
>> Thanks 
>

Re: Copyonwrite scan

2023-05-24 Thread russell . spitzer

Could you include the exception you are seeing?

Sent from my iPhone

> On May 23, 2023, at 9:13 PM, Gaurav Agarwal  wrote:
> 
> 
> Hi
> 
> We are getting
> " runtime file filtering exception the table has been concurrently modified 
> row level operation scan snapshot id "
> 
> This exception we got while trying to delete the data from the table and has 
> copyonwrite setting for delete operation . I check the code this is mentioned 
> as in sparkcopyonwritescan.java
> But why we are getting this exception if only one spark job is running which 
> tried to delete the data.
> Any info why or in which case we are seeing this exception
> 
> Thanks in advance

Re: Scan statistics

2023-05-22 Thread Russell Spitzer

 Here we need to keep
>>>every split in memory in the planning phase
>>>4. ??? - Do we have any other option to define the order of the
>>>files in an Iceberg plan?
>>>
>>> For a big enough dataset 1st is not an option.
>>> 2nd is also not an option in our case as we have to be prepared for
>>> downstream job failure
>>>
>>> We were able to implement the 3rd option. We created a Scan where we
>>> requested statistics for the data files, and used these statistics to
>>> execute the read tasks in order. Since the timestamp ranges in the files
>>> were correct, we were able to use the original allowed lateness settings to
>>> read the data.
>>>
>>> The issue with the 3rd option is that we are accepting higher memory
>>> usage during planning.
>>>
>>> It turns out that for a big enough table (a few hundred columns) the
>>> biggest part of this footprint is the statistics for the
>>> columns (columnSizes, valueCounts, nullValueCounts, nanValueCounts,
>>> lowerBounds, upperBounds). They were ~100k for each GenericDataFile, where
>>> a GenericDataFile without statistics is <1k. As a test we used reflection
>>> to null out the not needed statistics (columnSizes, valueCounts,
>>> nullValueCounts, nanValueCounts, upperBounds) from the Tasks, and the JM
>>> memory usage is decreased to 10 percent of the original when we requested
>>> the full statistics for the files. If we could define the specific column
>>> which is needed, the memory usage would decrease back to the same level as
>>> was before we requested any statistics.
>>>
>>> I think this would be useful for every Iceberg user where the statistics
>>> is required for the Tasks:
>>>
>>>- Decreased memory usage
>>>- Decreased serialization cost
>>>
>>> To achieve this we would need:
>>>
>>>- Scan.java
>>>
>>>
>>> *public BatchScan includeColumnStats(Collection columns) {
>>>return new BatchScanAdapter(scan.includeColumnStats(columns)); }*
>>>- ContentFile.java
>>>*F copyWithSpecificStats(Collection statsToKeep);*
>>>
>>> I hope this helped to explain my point better.
>>> If you have better ideas, I would be happy to examine those.
>>>
>>> Thanks,
>>> Peter
>>>
>>> Péter Váry  ezt írta (időpont: 2023. máj.
>>> 15., H, 18:52):
>>>
>>>> Thanks Ryan, Russel for the quick response!
>>>>
>>>> In our Flink job we have TumblingEventTimeWindow to filter out old data.
>>>> There was a temporary issue with accessing the Catalog, and our Flink
>>>> job was not able to read the data from the Iceberg table for a while.
>>>>
>>>> When the Flink job was able to access the Catalog again then it fetched
>>>> all the data from the table which arrived during the downtime. Since the
>>>> planning does not guarantee the order of the Tasks we ended up out of order
>>>> records which is not desirable in our case.
>>>>
>>>> We were able to fetch all of the splits (Flink does this currently) and
>>>> sort them based on the stats. Flink SplitAssigner interface allowed us to
>>>> serve the splits in the given order and this way we did not have late
>>>> events any longer (we needed to do extra work to provide the required
>>>> watermarks, but that is not that relevant here).
>>>>
>>>> If I understand correctly, the ManifestGroups are good for filtering
>>>> the plan results. Our requirement is not really filtering, but ordering the
>>>> Tasks. Is there a way to do that?
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>> Ryan Blue  ezt írta (időpont: 2023. máj. 15., H,
>>>> 18:07):
>>>>
>>>>> Yes, I agree with Russell. You'd want to push the filter into planning
>>>>> rather than returning stats. That's why we strip out stats when the file
>>>>> metadata is copied. It also would be expensive to copy some, but not all 
>>>>> of
>>>>> the file stats. It's better not to store the stats you don't need.
>>>>>
>>>>> What about using the ManifestGroup interface to get finer-grained
>>>>> control of the planning?
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, May 15, 2023 at 8:05 AM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> I think currently the recommendation would be to filter the iterator
>>>>>> rather than pulling the whole object with stat's into memory. Is there a
>>>>>> requirement that all of the DataFiles be pulled into memory before
>>>>>> filtering?
>>>>>>
>>>>>> On Mon, May 15, 2023 at 9:49 AM Péter Váry <
>>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Team,
>>>>>>>
>>>>>>> We have a Flink job where we would like to use the Iceberg File
>>>>>>> statistics (lowerBounds, upperBounds) during the planning phase.
>>>>>>>
>>>>>>> Currently it is possible to parameterize the Scan to include the
>>>>>>> statistics using the includeColumnStats [1]. This is an on/off switch, 
>>>>>>> but
>>>>>>> currently there is no way to configure this on a finer granularity.
>>>>>>>
>>>>>>> Sadly our table has plenty of columns and requesting statistics for
>>>>>>> every column will result in GenericDataFiles objects where the
>>>>>>> retained heap is ~100k each. We have a few thousand data files and
>>>>>>> requesting statistics for them would add serious extra memory load to 
>>>>>>> our
>>>>>>> job.
>>>>>>>
>>>>>>> I was considering adding a new method to the Scan class like this:
>>>>>>> -
>>>>>>> ThisT includeColumnStats(Collection columns);
>>>>>>> -
>>>>>>>
>>>>>>> Would the community consider this as a valuable addition to the Scan
>>>>>>> API?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peter
>>>>>>>
>>>>>>> [1]
>>>>>>> https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>

Re: Scan statistics

2023-05-15 Thread Russell Spitzer

I think currently the recommendation would be to filter the iterator rather
than pulling the whole object with stat's into memory. Is there a
requirement that all of the DataFiles be pulled into memory before
filtering?

On Mon, May 15, 2023 at 9:49 AM Péter Váry 
wrote:

> Hi Team,
>
> We have a Flink job where we would like to use the Iceberg File statistics
> (lowerBounds, upperBounds) during the planning phase.
>
> Currently it is possible to parameterize the Scan to include the
> statistics using the includeColumnStats [1]. This is an on/off switch, but
> currently there is no way to configure this on a finer granularity.
>
> Sadly our table has plenty of columns and requesting statistics for every
> column will result in GenericDataFiles objects where the retained heap is
> ~100k each. We have a few thousand data files and requesting statistics for
> them would add serious extra memory load to our job.
>
> I was considering adding a new method to the Scan class like this:
> -
> ThisT includeColumnStats(Collection columns);
> -
>
> Would the community consider this as a valuable addition to the Scan API?
>
> Thanks,
> Peter
>
> [1]
> https://github.com/apache/iceberg/blob/f536c840350bd5628d7c514d2a4719404c9b8ed1/api/src/main/java/org/apache/iceberg/Scan.java#L71-L78
>

Re: Slack invitation

2023-05-10 Thread Russell Spitzer

Cool i'll update the link asap

https://join.slack.com/t/apache-iceberg/shared_invite/zt-1uva9gyp1-TrLQl7o~nZ5PsTVgl6uoEQ
 
<https://join.slack.com/t/apache-iceberg/shared_invite/zt-1uva9gyp1-TrLQl7o~nZ5PsTVgl6uoEQ>

New Link ^

> On May 10, 2023, at 12:53 PM, Gaurav Agarwal  wrote:
> 
> I tried that its not working for me, I updated below ss to check as i am not 
> doing anything wrong
> 
> 
> when i tried to login with google, it says
> 
> 
> On Wed, May 10, 2023 at 9:50 PM Russell Spitzer  <mailto:russell.spit...@gmail.com>> wrote:
> The doc's say this
> 
> Slack <https://iceberg.apache.org/community/#slack>
> We use the Apache Iceberg workspace <https://apache-iceberg.slack.com/> on 
> Slack. To be invited, follow this invite link 
> <https://join.slack.com/t/apache-iceberg/shared_invite/zt-1oj35f7yc-wuTEhvkiqjGLje83B7rG8A>.
> 
> Please note that this link may occasionally break when Slack does an upgrade. 
> If you encounter problems using it, please let us know by sending an email to 
> dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>.
> 
> 
> 
> The link provided there let's up to 400 people self invite, once it stops 
> working we just have to add a new link, so that's what i'm asking. Do we need 
> to update the link, or does it work?
> 
>> On May 10, 2023, at 11:16 AM, Gaurav Agarwal > <mailto:gaurav130...@gmail.com>> wrote:
>> 
>> there also it says if you dont apache email id then get the invitation 
>> Don’t have an @apache.org <http://apache.org/> email address?
>> Contact the workspace administrator at apache-iceberg for an invitation.
>> 
>> On Wed, May 10, 2023 at 9:13 PM Russell Spitzer > <mailto:russell.spit...@gmail.com>> wrote:
>> Does this link no longer work?
>> 
>> https://iceberg.apache.org/community/#slack 
>> <https://iceberg.apache.org/community/#slack>
>> 
>> > On May 10, 2023, at 12:58 AM, Thijs van de Poll 
>> > mailto:thijsvandep...@getfocus.eu>> wrote:
>> > 
>> > Hi, 
>> > 
>> > I would like to get access to the Apache Iceberg community on Slack if 
>> > that would be possible. Would you mind sending an invitation to this email 
>> > address?
>> > 
>> > Best regards,
>> > 
>> > Thijs van de Poll
>> 
>

Re: Slack invitation

2023-05-10 Thread Russell Spitzer

The doc's say this

Slack <https://iceberg.apache.org/community/#slack>
We use the Apache Iceberg workspace <https://apache-iceberg.slack.com/> on 
Slack. To be invited, follow this invite link 
<https://join.slack.com/t/apache-iceberg/shared_invite/zt-1oj35f7yc-wuTEhvkiqjGLje83B7rG8A>.

Please note that this link may occasionally break when Slack does an upgrade. 
If you encounter problems using it, please let us know by sending an email to 
dev@iceberg.apache.org <mailto:dev@iceberg.apache.org>.

The link provided there let's up to 400 people self invite, once it stops 
working we just have to add a new link, so that's what i'm asking. Do we need 
to update the link, or does it work?

> On May 10, 2023, at 11:16 AM, Gaurav Agarwal  wrote:
> 
> there also it says if you dont apache email id then get the invitation 
> Don’t have an @apache.org <http://apache.org/> email address?
> Contact the workspace administrator at apache-iceberg for an invitation.
> 
> On Wed, May 10, 2023 at 9:13 PM Russell Spitzer  <mailto:russell.spit...@gmail.com>> wrote:
> Does this link no longer work?
> 
> https://iceberg.apache.org/community/#slack 
> <https://iceberg.apache.org/community/#slack>
> 
> > On May 10, 2023, at 12:58 AM, Thijs van de Poll  > <mailto:thijsvandep...@getfocus.eu>> wrote:
> > 
> > Hi, 
> > 
> > I would like to get access to the Apache Iceberg community on Slack if that 
> > would be possible. Would you mind sending an invitation to this email 
> > address?
> > 
> > Best regards,
> > 
> > Thijs van de Poll
>

Re: Slack invitation

2023-05-10 Thread Russell Spitzer

Does this link no longer work?

https://iceberg.apache.org/community/#slack

> On May 10, 2023, at 12:58 AM, Thijs van de Poll  
> wrote:
> 
> Hi, 
> 
> I would like to get access to the Apache Iceberg community on Slack if that 
> would be possible. Would you mind sending an invitation to this email address?
> 
> Best regards,
> 
> Thijs van de Poll

Re: Support create table like for Iceberg table?

2023-05-09 Thread Russell Spitzer

How would Create Table Like, be different than our "Snapshot" procedure,
just enabled for Iceberg Tables? Wondering if we should just expand that
functionality.

On Tue, May 9, 2023 at 11:54 AM Pucheng Yang 
wrote:

> Ryan, when I mentioned "copy of the data'', I didn't mean to
> physically copy the data. I meant the copy of metadata and configuration
> such that the created table can also read the data that belongs to the
> table we created from. However, I do shared the concern that CREATE TABLE
> LIKE, if we plan to follow what most systems do, will copy some important
> configuration (such as gc.enabled) that I think we definitely don't want
> since it will create a surface for people to mess up the original table. In
> this regard, I agree we should adopt the approach of having a procedure
> instead. So I am dropping this CREATE TABLE LIKE feature request.
>
> Anton, branching will work but I will still prefer creating a separate
> table for these reasons: (1) I considered "branching" as a very advanced/
> new feature to my customers and it is generally easy and safe to just let
> them use a separate test table. (2) the new generated data will be placed
> under a separate location making auditing and clean up easier. (3) if we
> use branching, there is coordination between the user who is doing testing
> via branching and the platform who is constantly performing table
> maintenance, thus introducing frictions.
>
> On Thu, Apr 27, 2023 at 2:15 PM Anton Okolnychyi
>  wrote:
>
>> Iceberg supports branching so that you can safely perform such tests
>> without any risk of corrupting the table. No need to create a separate
>> table and clone the config. Overall, I don’t think it is a good idea to
>> break the contract of CREATE TABLE LIKE.
>>
>> - Anton
>>
>> On Apr 27, 2023, at 11:59 AM, Pucheng Yang 
>> wrote:
>>
>> Hi Anton,
>>
>> Yes, I want to branch the table state and reuse the data files, but for
>> test purposes only. Imagine if we want to test something related to reading
>> the Iceberg table or perform row level update.
>>
>> And I acknowledge the potential risk of the table state being corrupted.
>> So I am thinking we can consider adding these limitations when running the
>> "create table like":
>> (1) the created table should have "snapshot=true"
>> (2) the created table should have "gc.enabled=false" to make sure
>> existing files don't get messed up
>> (3) the created table should have a table location different then the
>> existing Iceberg table location it creates from
>> We can consider "create table like" as a snapshot action for an existing
>> Iceberg table, similar to the existing snapshot procedure we have for an
>> existing Hive table.
>>
>> I know CREATE TABLE LIKE is supposed to be copy reuse existing table
>> definition only. If we have concerns around messing up table state, I wish
>> we can break it down into the implementation and at least first implement
>> the part where we create tables without reusing the existing data files.
>>
>> On Wed, Apr 26, 2023 at 8:26 AM Anton Okolnychyi <
>> aokolnyc...@apple.com.invalid> wrote:
>>
>>> Pucheng, you mentioned you want to reuse existing data in the new table?
>>> Branching Iceberg table state can lead to unexpected situations as there
>>> will be multiple pointers in the catalog to the same state, which can
>>> eventually corrupt the table. Isn’t CREATE TABLE LIKE supposed to just
>>> reuse the existing table definition without copying the data?
>>>
>>> - Anton
>>>
>>> On Apr 26, 2023, at 5:41 AM, Zoltán Borók-Nagy 
>>> wrote:
>>>
>>> As a reference, Impala can also do Hive-style CREATE TABLE x LIKE y for
>>> Iceberg tables.
>>> You can see various examples at
>>> https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/iceberg-create-table-like-table.test
>>>
>>> - Zoltan
>>>
>>> On Wed, Apr 26, 2023 at 4:10 AM Ryan Blue  wrote:
>>>
 You should be able to see how other DSv2 commands are written and copy
 them. Look at Drop Table, maybe and see if you can copy the structure, but
 instead of dropping, load the table and call createTable with its metadata.

 On Tue, Apr 25, 2023 at 4:42 PM Pucheng Yang <
 py...@pinterest.com.invalid> wrote:

> Thanks Steve and Ryan for the reply.
>
> Steve, I am not looking for CTAS, my goal is to create an Iceberg
> table and reuse the existing data (same as the create table like statement
> above). Also my question is not about specifying location in
> create statement.
>
> Ryan, the engine we are interested in is SparkSQL. Since you mentioned
> it is an easy fix, would you please share how that should be implemented
> such that anyone (maybe myself) interested in this can explore the 
> solution?
>
> Thanks both again.
>
> On Tue, Apr 25, 2023 at 4:07 PM Ryan Blue  wrote:
>
>> Pucheng, what engine are you interested in?
>>
>> This works fine in Trino: CR

Re: Welcome new committers and PMC!

2023-05-03 Thread Russell Spitzer

Great news! It's so exciting to have the project continue to grow!

> On May 3, 2023, at 2:06 PM, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> I want to congratulate Amogh and Eduard, who were just added as Ierberg 
> committers and Szehon, who was just added to the PMC. Thanks for all your 
> contributions!
> 
> Ryan
> 
> -- 
> Ryan Blue

Re: Why is sort required for Spark writing to partitioned table

2023-04-25 Thread Russell Spitzer

https://github.com/apache/iceberg/issues/7037

On Tue, Apr 25, 2023 at 1:52 PM Pucheng Yang 
wrote:

> Great thanks, it will be great if we can update the doc to avoid confusion.
>
> On Tue, Apr 25, 2023 at 11:47 AM Anton Okolnychyi
>  wrote:
>
>> We have implemented this natively in Spark and explicit sorts are no
>> longer required. Iceberg takes into account both the partition and sort key
>> in the table to request a distribution and ordering from Spark. Should be
>> supported both for batch and micro-batch writes.
>>
>> - Anton
>>
>> On Apr 25, 2023, at 11:05 AM, Pucheng Yang 
>> wrote:
>>
>> Hi to confirm,
>>
>> In the doc,
>> https://iceberg.apache.org/docs/1.0.0/spark-writes/#writing-to-partitioned-tables,
>> it says "Explicit sort is necessary because Spark doesn’t allow Iceberg to
>> request a sort before writing as of Spark 3.0. SPARK-23889
>> <https://issues.apache.org/jira/browse/SPARK-23889> is filed to enable
>> Iceberg to require specific distribution & sort order to Spark."
>>
>> I found that all relevant JIRAs in SPARK-23889
>> <https://issues.apache.org/jira/browse/SPARK-23889> are resolved in
>> spark-3.2.0. Does that mean we don't need explicit sort  anymore from
>> spark-3.2.0 and after?
>>
>> Thanks
>>
>> On Tue, Mar 7, 2023 at 8:10 PM Russell Spitzer 
>> wrote:
>>
>>> This is no longer accurate, since now we do have a "fan-out" writer for
>>> spark. But originally the idea here is that it is way more efficient to
>>> open a single file handle at a time and write to it, than to open a new
>>> file handle for every file as we find a new partition to write to in the
>>> same spark task. Fanout performs the write as just opening each handle as
>>> the writer sees a new partition.
>>>
>>> Now that said, this is a local required sort for the default writer. For
>>> best performance though in making as few files as possible using write
>>> distribution mode "Hash" will force a real shuffle but eliminate this issue
>>> by making sure each spark task is writing to a single or single set of
>>> Partitions in order. We need to update this document to talk about
>>> distribution modes, especially since hash will be the new default soon and
>>> this information is basically for manual tuning only.
>>>
>>> If your data is already organized the way you want, setting distribution
>>> mode to none will avoid this shuffle. If you don't care about multiple file
>>> handles being open at the same time, you can set the fanout writer option.
>>> With "none" and "fan-out" writers you will basically write in the fastest
>>> way possible at the expense of memory at write time and possibly generating
>>> many files if your data isn't organized.
>>>
>>> On Tue, Mar 7, 2023 at 9:46 PM Manu Zhang 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> As per
>>>> https://iceberg.apache.org/docs/latest/spark-writes/#writing-to-partitioned-tables,
>>>> sort is required for Spark writing to a partitioned table. Does anyone know
>>>> the reason behind it? If this is to avoid creating too many small files,
>>>> isn't shuffle/repartition sufficient?
>>>>
>>>> Thanks,
>>>> Manu
>>>>
>>>>
>>

Re: [DISCUSS] Spark 3.1 support?

2023-04-22 Thread russell . spitzer

If you are already back-porting patches to a different branch which isn’t receiving other fixes anyway, why does it help to keep 3.1 support in master? If we kept it there, we would be doing so for users who want a 1.2+ Iceberg version with 3.1 support. That doesn’t sound like your use case. I’m not against doing so it just doesn’t seem like this would really benefit you. You would just end up backporting twice. Once to master, then again from that branch back to 0.13.Sent from my iPhoneOn Apr 23, 2023, at 12:07 AM, Manu Zhang wrote:Hi Russel,What do you mean by "keep these changes in master"? Can you elaborate?As for Iceberg, we back-port spark/v3.1 patches from master branch.On Sun, Apr 23, 2023 at 10:04 AM wrote:If you are on forked 0.13 is it important to keep these changes in master?Sent from my iPhoneOn Apr 22, 2023, at 8:42 PM, Manu Zhang wrote:I'd like to share our maintenance strategy and history at eBay.We are now on forked versions of Iceberg 0.13.1 and Spark 3.1.1. For Spark, We started to evaluate upgrading to 3.1.1 from 2.3/2.4 in H2, 2021 since it was the latest and most stable version then.After migrating internal changes and finishing tests, we rolled out to customers for our managed platforms (mainly SQL) or pushed them to upgrade for their own (mainly Scala and PySpark). At this time, there are still less than 10% customers that haven't upgraded. It's unlikely we will make another major upgrade soon. We've been back-porting bug fixes from Spark branch-3.1 but now we are on our own. For a company size like eBay, I don't think it's unusual to spend more than 18 months to do such a major upgrade. The 18-month maintenance period is too short, in my opinion. (BTW, Spark 3.2 just made its final release.)The benefit of a community maintained branch is that we can always be notified of critical bug fixes and fix them proactively before they impact our customers. Can we at least open GitHub issues for back-porting bug fixes and see whoever cares steps up? I'm more than willing to do it. If after sometime, no one wants to pick up the back-port tasks, maybe we can eventually announce it EOL. WDYT?Thanks,ManuOn Sun, Apr 23, 2023 at 3:43 AM Ryan Blue wrote:+1 for marking 3.1 deprecated.On Sat, Apr 22, 2023 at 10:20 AM Jack Ye wrote:Here was the original lifecycle of engine version support guideline we came up with: https://iceberg.apache.org/multi-engine-support/#current-engine-version-lifecycle-statusI think we can at least mark 3.1 support as deprecated, which matches the situation here that "People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity." But we could keep it around for some more time given there is still active usage of it.JackOn Fri, Apr 21, 2023 at 5:32 PM Steven Wu wrote:> without requiring authors to cherry-pick all applicable changes, like we agreed initially.Not trying to change what agreed before. Just for my understanding. Let's say the latest Spark version is 3.3. Today, we don't require any backport to 3.2 and 3.1, correct?On Fri, Apr 21, 2023 at 5:19 PM Ryan Blue wrote:I still agree with the idea that people interested in Spark 3.1 should be primarily responsible for keeping it updated. Backporting patches is up to the contributor.The only concern I have about keeping Hive 3.1 is whether there are important bugs or security issues that are not getting backported. That would signal that the branch is not maintained enough to continue releasing it. But if we are still seeing important problems getting fixed, I think it should be primarily up to the people maintaining the branch.On Fri, Apr 21, 2023 at 5:14 PM Anton Okolnychyi wrote:We backported only a small number of changes to 3.1, compared to 3.2. At this point, they also diverged quite a bit so doing those backports is hard. When we discussed how to support multiple engine versions, the community initially agreed that it’s optional for authors to cherry-pick changes into older versions and should be done by other members of the community interested in those integrations. That’s what led us to where we are today. We may reconsider this approach but only if the there is a small number of versions to support. I am also OK to keep older modules but only to provide folks a place to collaborate, without requiring authors to cherry-pick all applicable changes, like we agreed initially.- AntonOn Apr 21, 2023, at 3:58 PM, Ryan Blue wrote:Good question about backports. Walaa and Edgar, are you backporting fixes to 3.1? It makes sense to have a place to collaborate, but only if people are actively keeping them updated.On Fri, Apr 21, 2023 at 3:54 PM Steven Wu wrote:For the 3.1 activities that Ryan linked, 3.1 are updated probably fo

Re: [DISCUSS] Spark 3.1 support?

2023-04-22 Thread russell . spitzer

If you are on forked 0.13 is it important to keep these changes in master?Sent from my iPhoneOn Apr 22, 2023, at 8:42 PM, Manu Zhang wrote:I'd like to share our maintenance strategy and history at eBay.We are now on forked versions of Iceberg 0.13.1 and Spark 3.1.1. For Spark, We started to evaluate upgrading to 3.1.1 from 2.3/2.4 in H2, 2021 since it was the latest and most stable version then.After migrating internal changes and finishing tests, we rolled out to customers for our managed platforms (mainly SQL) or pushed them to upgrade for their own (mainly Scala and PySpark). At this time, there are still less than 10% customers that haven't upgraded. It's unlikely we will make another major upgrade soon. We've been back-porting bug fixes from Spark branch-3.1 but now we are on our own. For a company size like eBay, I don't think it's unusual to spend more than 18 months to do such a major upgrade. The 18-month maintenance period is too short, in my opinion. (BTW, Spark 3.2 just made its final release.)The benefit of a community maintained branch is that we can always be notified of critical bug fixes and fix them proactively before they impact our customers. Can we at least open GitHub issues for back-porting bug fixes and see whoever cares steps up? I'm more than willing to do it. If after sometime, no one wants to pick up the back-port tasks, maybe we can eventually announce it EOL. WDYT?Thanks,ManuOn Sun, Apr 23, 2023 at 3:43 AM Ryan Blue wrote:+1 for marking 3.1 deprecated.On Sat, Apr 22, 2023 at 10:20 AM Jack Ye wrote:Here was the original lifecycle of engine version support guideline we came up with: https://iceberg.apache.org/multi-engine-support/#current-engine-version-lifecycle-statusI think we can at least mark 3.1 support as deprecated, which matches the situation here that "People who are still interested in the version can backport any necessary feature or bug fix from newer versions, but the community will not spend effort in achieving feature parity." But we could keep it around for some more time given there is still active usage of it.JackOn Fri, Apr 21, 2023 at 5:32 PM Steven Wu wrote:> without requiring authors to cherry-pick all applicable changes, like we agreed initially.Not trying to change what agreed before. Just for my understanding. Let's say the latest Spark version is 3.3. Today, we don't require any backport to 3.2 and 3.1, correct?On Fri, Apr 21, 2023 at 5:19 PM Ryan Blue wrote:I still agree with the idea that people interested in Spark 3.1 should be primarily responsible for keeping it updated. Backporting patches is up to the contributor.The only concern I have about keeping Hive 3.1 is whether there are important bugs or security issues that are not getting backported. That would signal that the branch is not maintained enough to continue releasing it. But if we are still seeing important problems getting fixed, I think it should be primarily up to the people maintaining the branch.On Fri, Apr 21, 2023 at 5:14 PM Anton Okolnychyi wrote:We backported only a small number of changes to 3.1, compared to 3.2. At this point, they also diverged quite a bit so doing those backports is hard. When we discussed how to support multiple engine versions, the community initially agreed that it’s optional for authors to cherry-pick changes into older versions and should be done by other members of the community interested in those integrations. That’s what led us to where we are today. We may reconsider this approach but only if the there is a small number of versions to support. I am also OK to keep older modules but only to provide folks a place to collaborate, without requiring authors to cherry-pick all applicable changes, like we agreed initially.- AntonOn Apr 21, 2023, at 3:58 PM, Ryan Blue wrote:Good question about backports. Walaa and Edgar, are you backporting fixes to 3.1? It makes sense to have a place to collaborate, but only if people are actively keeping them updated.On Fri, Apr 21, 2023 at 3:54 PM Steven Wu wrote:For the 3.1 activities that Ryan linked, 3.1 are updated probably for the requirement of backporting (keeping 3.1, 3.2, 3.3 in sync). It is the adopted policy. Not sure if it is an indication that people are actively collaborating on 3.1. As Anton was saying, backporting/syncing 4 versions (3.1, 3.2, 3.3, 3.4) is a pretty high budden.On Fri, Apr 21, 2023 at 2:29 PM Anton Okolnychyi wrote:If it is being used by folks in the community, let’s keep it for now. That said, let’s come up with a strategy on when to eventually drop it as the list cannot grow indefinitely. Our initial agreement was to keep last 3 (except Spark LTS versions), which worked well for 18 months of support promised by the Spark community. At this point, Spark will not release any bug fixes for 3.1, even critical.Walaa, Edgar, can you tel

1 2 3 >

1 - 100 of 203 matches

Mail list logo