Re: Seeking guidance - Apache Iceberg

Russell Spitzer Tue, 31 Mar 2026 06:20:23 -0700

Yes of course, what do I need to do?

On Tue, Mar 31, 2026 at 7:35 AM Varun Lakhyani <[email protected]>
wrote:


> Hello Russell
>
> Today's the deadline for submitting proposals and I have mine ready but I
> got to know that before 1st April (Tomorrow) all proposals must have
> accepted mentors.
> Can I mention you as a potential mentor and If possible I would need you
> to register as a mentor for ASF organization and approve my proposal.
> I have applied using email id: [email protected]
> Attached is my proposal that I will be submitting.
>
> Thanks and Apologies for last minute request
>
> On Sat, Mar 21, 2026 at 1:08 PM Varun Lakhyani <[email protected]>
> wrote:
>
>> Thanks a lot Russell,
>> Voting thread got a good response.
>> I am already working on final proposal to submit, will share it soon,
>>
>> On Sat, Mar 21, 2026 at 1:48 AM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> That seems about right, (I actually thought it may be even worse) . I
>>> will try my best to get to that thread but It's a very busy week for me as
>>> Iceberg summit is just 1 work week away for me
>>>
>>> On Fri, Mar 20, 2026 at 2:21 PM Varun Lakhyani <
>>> [email protected]> wrote:
>>>
>>>> Hi Russell,
>>>>
>>>> I benchmarked it against AWS S3 as source and destination to get
>>>> natural IO overhead for cloud instead of manually adding it.
>>>> AWS S3 (1000 files - 14.6 Kb each):
>>>>
>>>>    -
>>>>
>>>>    Sync time : 219.694 s
>>>>    -
>>>>
>>>>    Async time = 51.853 s
>>>>    -
>>>>
>>>>    % Improvement = 76.4%
>>>>
>>>> I think this might help you to get a better overview.
>>>> I would really appreciate any feedback on
>>>> https://lists.apache.org/thread/rvbwmcbrlr3syd1movflw3vmprm27nmz
>>>>
>>>> On Wed, Mar 18, 2026 at 12:39 AM Varun Lakhyani <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Russell,
>>>>>
>>>>> Thanks again for the discussion and feedback during the Spark sync
>>>>> call.
>>>>> I have raised a DISCUSS thread on the dev mailing list for formal GSoC
>>>>> idea vetting for the Spark readers parallel execution work. I would really
>>>>> appreciate it if you could take a look when you get time and share any
>>>>> feedback.
>>>>>
>>>>> Vetting Discussion thread:
>>>>> https://lists.apache.org/thread/rvbwmcbrlr3syd1movflw3vmprm27nmz
>>>>>
>>>>> Further I would check the comet's reader code path and am thinking of
>>>>> the next step as going through parallel iterable in the Iceberg codebase
>>>>> and making required changes for this use case (if any).
>>>>> Thanks
>>>>>
>>>>> On Wed, Feb 25, 2026 at 12:21 AM Russell Spitzer <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> You can always ping me but you should keep up with the dev mailing
>>>>>> list thread, and add a item to the Spark Iceberg Community meetup. You
>>>>>> should be able to find it on the dev calendar
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 24, 2026 at 11:19 AM Varun Lakhyani <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hello Russell,
>>>>>>>
>>>>>>> Apologies for contacting you again personally. Please take a look at
>>>>>>> this whenever you are available.
>>>>>>> I have completed high level design/POC upto certain level along with
>>>>>>> some benchmarking by creating benchmark file similar to other iceberg
>>>>>>> services. (Benchmarking numbers seems good to me as of now).
>>>>>>>
>>>>>>> Please if you can refer those once:
>>>>>>> PR with code changes <https://github.com/apache/iceberg/pull/15341>
>>>>>>>  | Issue raised <https://github.com/apache/iceberg/issues/15287> | 
>>>>>>> Reference
>>>>>>> documents showing detailed approach and benchmarks
>>>>>>> <https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing>
>>>>>>>
>>>>>>>
>>>>>>> If you could give me further direction on this. I am keeping dev
>>>>>>> mailing thread updated
>>>>>>> <https://lists.apache.org/thread/b5jrlyv61lmw867kksw05sot2tro5ybn>
>>>>>>> with these but I would need a formal vetting for this idea to get my
>>>>>>> proposal considered.
>>>>>>> Roughly what could be an appropriate date for me to post for
>>>>>>> vetting, Proposal submission starts on 16th March.
>>>>>>>
>>>>>>> I would be happy to include anything else or try out any different
>>>>>>> approach or include any suggestions that you might have.
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> On Thu, Feb 12, 2026 at 5:21 AM Russell Spitzer <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Looks good to me
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 11, 2026 at 8:43 AM Varun Lakhyani <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Whenever you get a chance, please if you can take a look at this.
>>>>>>>>>
>>>>>>>>> Thanks[image: ltp|17708280969932948]
>>>>>>>>>
>>>>>>>>> On Wed, Feb 11, 2026 at 2:00 AM Varun Lakhyani <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hello Russell,
>>>>>>>>>> I reviewed both the ideas in detail and I think I'll be able to
>>>>>>>>>> work on the 2nd task: Making Spark readers run tasks parallely.
>>>>>>>>>>
>>>>>>>>>> I went through the BaseReader.java  which seems to be
>>>>>>>>>> foundational for all readers and each of them have their own
>>>>>>>>>> implementations specifically of open() function.
>>>>>>>>>> I will figure out a formal proposal till 25th including estimated
>>>>>>>>>> designs, task-time distribution till 25th February, official proposal
>>>>>>>>>> submission timeline is 16th March - 31st March.
>>>>>>>>>> As of now, I have raised this issue/feature request on github and
>>>>>>>>>> I am thinking of raising this discussion at dev mailing list of 
>>>>>>>>>> iceberg to
>>>>>>>>>> get it vetted as It has to be approved on project's mailing list for 
>>>>>>>>>> ASF to
>>>>>>>>>> consider proposal.
>>>>>>>>>>
>>>>>>>>>> I have attached a draft of that discussion that I will raise,
>>>>>>>>>> Please let me know your thoughts and If you think there should be any
>>>>>>>>>> changes.
>>>>>>>>>> Also, can I contact you for any specific clarifications or
>>>>>>>>>> anything bothering me related to this project or is there any 
>>>>>>>>>> appropriate
>>>>>>>>>> point of contact for this?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 10, 2026 at 4:11 AM Russell Spitzer <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> I definitely think this is medium (if not smaller.) I don't
>>>>>>>>>>> think we have any problem with that
>>>>>>>>>>>
>>>>>>>>>>> Another task idea, and this is probably more of a medium to large
>>>>>>>>>>>
>>>>>>>>>>> Make our Spark readers optionally function asynchronously for
>>>>>>>>>>> tasks with many small files. The general thought there is that when 
>>>>>>>>>>> we have
>>>>>>>>>>> say 1000 4kb files, we currently open them one at a time in order. 
>>>>>>>>>>> This is
>>>>>>>>>>> slow and bad. We should instead try to open some number of those 
>>>>>>>>>>> data files
>>>>>>>>>>> in parallel then stitch them together into a buffer or iterator for
>>>>>>>>>>> downstream processing. This is abit of a bigger refactor but would
>>>>>>>>>>> dramatically help a lot of use cases in Iceberg including cleanup 
>>>>>>>>>>> of small
>>>>>>>>>>> files in compaction.
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Feb 9, 2026 at 3:46 PM Varun Lakhyani <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, Russell, I am very much interested in working on tasks or
>>>>>>>>>>>> ideas that the community already wants, especially if reviewers 
>>>>>>>>>>>> can be
>>>>>>>>>>>> pre-identified.
>>>>>>>>>>>>
>>>>>>>>>>>> I looked into the Spark defaults issue you mentioned,
>>>>>>>>>>>> specifically
>>>>>>>>>>>> org/apache/iceberg/spark/sql/TestSparkDefaultValues.java,
>>>>>>>>>>>> including testCreateTableWithDefaultsUnsupported() and
>>>>>>>>>>>> testAlterTableAddColumnWithDefaultUnsupported().
>>>>>>>>>>>> From my initial analysis:
>>>>>>>>>>>>
>>>>>>>>>>>>    - The *ALTER TABLE* path passes Spark’s validation stage
>>>>>>>>>>>>    and fails at the Iceberg layer. This seems addressable by 
>>>>>>>>>>>> converting the
>>>>>>>>>>>>    Spark literal into an Iceberg literal for the data types 
>>>>>>>>>>>> Iceberg supports.
>>>>>>>>>>>>    - The *CREATE TABLE* path fails earlier during Spark
>>>>>>>>>>>>    analysis. This appears to be due to Spark catalog capability 
>>>>>>>>>>>> checks, and
>>>>>>>>>>>>    declaring ACCEPT_ANY_SCHEMA for the Iceberg catalog should 
>>>>>>>>>>>> allow defaults
>>>>>>>>>>>>    to pass Spark validation, after which similar Spark to Iceberg 
>>>>>>>>>>>> literal
>>>>>>>>>>>>    handling can be applied during schema creation.
>>>>>>>>>>>>
>>>>>>>>>>>> These are rough conclusions from a first pass. I plan to take a
>>>>>>>>>>>> deeper look at the end to end flow and implementation details to 
>>>>>>>>>>>> ensure the
>>>>>>>>>>>> approach is correct and aligns well with Iceberg’s design.
>>>>>>>>>>>>
>>>>>>>>>>>> ASF’s GSoC 2026 ideas list mentions two common project sizes:
>>>>>>>>>>>> 175 hours (medium) and 350 hours (large). From my understanding, 
>>>>>>>>>>>> this idea
>>>>>>>>>>>> work could reasonably fit into the 175-hour category.
>>>>>>>>>>>>
>>>>>>>>>>>> I’d really appreciate your advice on what would be best:
>>>>>>>>>>>>
>>>>>>>>>>>>    - Whether it makes sense to propose this Spark defaults
>>>>>>>>>>>>    work as a GSoC idea and get it vetted on the dev mailing list, 
>>>>>>>>>>>> or
>>>>>>>>>>>>    - Whether you’d recommend proposing a different idea for
>>>>>>>>>>>>    GSoC and doing this particular work independently before the 
>>>>>>>>>>>> coding period.
>>>>>>>>>>>>
>>>>>>>>>>>> [image: ltp|17706715353593601]
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks a lot for taking the time. I really appreciate your
>>>>>>>>>>>> guidance.
>>>>>>>>>>>> Looking forward to your response.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Feb 9, 2026 at 10:54 PM Russell Spitzer <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think we are always interested, but we tend to be stretched
>>>>>>>>>>>>> thin on Reviewing resources at the moment. If you are interested 
>>>>>>>>>>>>> I would
>>>>>>>>>>>>> try to find something that folks are already very interested in 
>>>>>>>>>>>>> and have
>>>>>>>>>>>>> some reviewers pre-selected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> At the moment we are very focused on finishing up the V4 spec
>>>>>>>>>>>>> which is a pretty huge undertaking but probably isn't good for a 
>>>>>>>>>>>>> first
>>>>>>>>>>>>> project. If you have time I think one rather contained project 
>>>>>>>>>>>>> could be
>>>>>>>>>>>>> making Spark Defaults work when creating an Iceberg table or 
>>>>>>>>>>>>> using an Alter
>>>>>>>>>>>>> table statement. Currently we just error out
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Feb 9, 2026 at 11:12 AM Varun Lakhyani <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello Russell,
>>>>>>>>>>>>>> I am Varun Lakhyani, a final-year undergraduate student at
>>>>>>>>>>>>>> IIT Roorkee, India.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I've been actively understanding and contributing to Apache
>>>>>>>>>>>>>> Iceberg for some time. So far, I have five merged PRs
>>>>>>>>>>>>>> <https://github.com/apache/iceberg/commits/main/?author=varun-lakhyani>,
>>>>>>>>>>>>>> one of which you reviewed, and one open PR involving a core
>>>>>>>>>>>>>> API module change
>>>>>>>>>>>>>> <https://github.com/apache/iceberg/pull/15252>. I have also
>>>>>>>>>>>>>> started discussion on the Iceberg dev mailing list
>>>>>>>>>>>>>> <https://lists.apache.org/thread/nmt8glsctsqrshx7fxc0ljtxp8h8jh6p>
>>>>>>>>>>>>>>  related
>>>>>>>>>>>>>> to this open PR to get broader review and feedback.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am interested in participating in Google Summer of Code
>>>>>>>>>>>>>> 2026 under the Apache Software Foundation working with Apache 
>>>>>>>>>>>>>> Iceberg. I
>>>>>>>>>>>>>> noticed Iceberg isn't currently listed in GSoC ideas list
>>>>>>>>>>>>>> <https://cwiki.apache.org/confluence/display/COMDEV/GSoC+2026+Ideas+list>.
>>>>>>>>>>>>>> ASF documentation
>>>>>>>>>>>>>> <https://community.apache.org/gsoc/#students-read-this> mentions
>>>>>>>>>>>>>> that contributors can propose new ideas for existing Apache 
>>>>>>>>>>>>>> projects,
>>>>>>>>>>>>>> provided those ideas are vetted on the project’s dev mailing 
>>>>>>>>>>>>>> list.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Given your experience with ASF projects and Apache Iceberg, I
>>>>>>>>>>>>>> wanted to seek your guidance on this.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    - Whether Iceberg would generally be open to GSoC
>>>>>>>>>>>>>>    participation if there is a well-scoped and project aligned 
>>>>>>>>>>>>>> idea.
>>>>>>>>>>>>>>    - Whether there are particular areas in Iceberg where a
>>>>>>>>>>>>>>    GSoC-sized project could realistically make sense and be 
>>>>>>>>>>>>>> useful to the
>>>>>>>>>>>>>>    community.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I’d really appreciate any direction or suggestions you may
>>>>>>>>>>>>>> have.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> Lakhyani Varun
>>>>>>>>>>>>>> Indian Institute of Technology Roorkee
>>>>>>>>>>>>>> Github <https://github.com/varun-lakhyani> | LinkedIn
>>>>>>>>>>>>>> <https://www.linkedin.com/in/varun-lakhyani-154a35250/> |
>>>>>>>>>>>>>> Codeforces <https://codeforces.com/profile/progskipper> |
>>>>>>>>>>>>>> Codechef <https://www.codechef.com/users/v_k_18>
>>>>>>>>>>>>>> Contact: +91 96246 46174
>>>>>>>>>>>>>> [image: ltp|17706565003431324]
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Seeking guidance - Apache Iceberg

Reply via email to