Re: Welcoming OpenInx as a new PMC member!

2021-06-30 Thread Gautam
Congratulations Zheng Hu! On Tue, Jun 29, 2021 at 8:17 PM OpenInx wrote: > Thanks all ! > > I really appreciate the trust from the Apache iceberg community. For me, > this is not only an honor, but also a responsibility. I'd like to share > something about the current apache iceberg status in

Re: Welcoming Russell Spitzer as a new committer

2021-03-29 Thread Gautam Kowshik
Congrats Russell! Sent from my iPhone > On Mar 29, 2021, at 9:41 AM, Dilip Biswal wrote: > >  > Congratulations Russel !! Very well deserved, indeed !! > >> On Mon, Mar 29, 2021 at 9:13 AM Miao Wang wrote: >> Congratulations Russell! >> >> >> >> Miao >> >> >> >> From: Szehon Ho >>

Re: Ways To Alleviate Load For Tables With Many Snapshots

2021-01-26 Thread Gautam
+ dawilcox On Tue, Jan 26, 2021 at 11:46 AM Gautam wrote: > Hey Ryan & David, > I believe this change from you [1] indirectly achieves this. > David's issue is that every table.load() is instantiating one FS handle for > each snapshot, and in your change, by con

Re: Adobe Blog ..

2021-01-15 Thread Gautam
ing to > helpful articles, slide decks, etc about Iceberg. In the trenches > information is often the most useful. > > On Fri, Jan 15, 2021 at 3:43 PM Ryan Blue > wrote: > >> Thanks, Gautam! I was just reading the one on query optimizations. Great >> that you are writ

Adobe Blog ..

2021-01-15 Thread Gautam
ceberg at Adobe" *[2]* and "High Throughput Ingestion with Iceberg" *[3]*. Hoping these are helpful to others.. thanks and regards, -Gautam. [1] - https://medium.com/adobetech/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f [2] - https://medium.com/adobetech

Re: Timestamp Based Incremental Reading in Iceberg ...

2020-09-10 Thread Gautam
raised. Regards, -Gautam. On Wed, Sep 9, 2020 at 5:07 PM Ryan Blue wrote: > Hi everyone, I'm putting this on the agenda for today's Iceberg sync. > > Also, I want to point out John's recent PR that added a way to inject a > Clock that is used for timestamp generation: > htt

Timestamp Based Incremental Reading in Iceberg ...

2020-09-08 Thread Gautam
this today or are they not exposing such a feature at all due to the inherent distributed timing problem? Would like to hear how others are thinking/going about this. Thoughts? Cheers, -Gautam.

Re: New committer: Shardul Mahadik

2020-07-23 Thread Gautam
Congratulations Shardul! On Thu, Jul 23, 2020 at 12:24 AM Shardul Mahadik wrote: > Thanks everyone!! > > Best, > Shardul > > On 2020/07/23 06:52:57, "Driesprong, Fokko" wrote: > > Congrats Shardul! Great work! > > > > Cheers, Fokko > > > > Op do 23 jul. 2020 om 07:46 schreef Miao Wang >: > >

Re: [VOTE] Release Apache Iceberg 0.9.0 RC5

2020-07-13 Thread Gautam
*Followed the steps:* 1. Downloaded the source tarball, signature (.asc), and checksum (.sha512) from https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/ 2. Downloaded https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS Import gpg keys: download KEYS and run gpg

Re: [VOTE] Graduate to a top-level project

2020-05-14 Thread Gautam
+1 We'v come a long way :-) On Wed, May 13, 2020 at 1:07 AM Dongjoon Hyun wrote: > +1 for graduation! > > Bests, > Dongjoon. > > On Tue, May 12, 2020 at 11:59 PM Driesprong, Fokko > wrote: > >> +1 >> >> Op wo 13 mei 2020 om 08:58 schreef jiantao yu >> >>> +1 for graduation. >>> >>> >>> 在

Re: [DISCUSS] Changes for row-level deletes

2020-05-06 Thread Gautam
My 2 cents : > * Merge manifest_entry and data_file? ... -1 .. keeping the difference between v1 and v2 metadata to a minimum would be my preference by keeping manifest_entries the same way in both v1 and v2. People using either flows would want to modify and contribute and shouldn't

Re: [VOTE] Release Apache Iceberg 0.8.0-incubating RC2

2020-05-01 Thread Gautam
Ran checks on https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc2/ √ RAT checks passed √ signature is correct √ checksum is correct √ build from source (with java 8) √ run tests locally +1 (non-binding) On Thu, Apr 30, 2020 at 4:18 PM Samarth Jain

Re: Open a new branch for row-delete feature ?

2020-03-30 Thread Gautam
can start taking those up too. thanks for the good work, - Gautam. On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen wrote: > +1 to create the branch. Some row-level delete subtasks must be based on > the sequence number as well as end to end tests. > > On Fri, Mar 27, 2020 at 4:4

Re: Shall we start a regular community sync up?

2020-03-19 Thread Gautam
5 / 5:30pm any day of next week works for me. On Thu, Mar 19, 2020 at 6:07 PM 李响 wrote: > 5 or 5:30 PM (UTC-7, is it PDT now) in any day works for me. > Looking forward to it 8-) > > On Fri, Mar 20, 2020 at 8:17 AM RD wrote: > >> Same time works for me too! >> >> On Thu, Mar 19, 2020 at 4:45

Re: Shall we start a regular community sync up?

2020-03-18 Thread Gautam
+1 for Monthly/fort-nightly and 5pm PST What day are we thinking for next meeting? On Wed, Mar 18, 2020 at 1:30 PM RD wrote: > +1 > > On Wed, Mar 18, 2020 at 10:49 AM Ryan Blue > wrote: > >> No problem, we can alternate times to include everyone. How about the >> next sync at 5 PM UTC+7 and

Re: Welcome new committer and PPMC member Ratandeep Ratti

2020-02-17 Thread Gautam
Congratulations and thanks for your work. On Sun, Feb 16, 2020 at 8:37 PM RD wrote: > Thanks everyone! > > -Best, > R. > > On Sun, Feb 16, 2020 at 7:39 PM David Christle > wrote: > >> Congrats!!! >> >> >> >> *From: *Jacques Nadeau >> *Reply-To: *"dev@iceberg.apache.org" >> *Date: *Sunday,

Re: Write reliability in Iceberg

2020-01-28 Thread Gautam
CustomTableOperations's *doCommit* implementation. Thanks for the guidance, -Gautam. On Tue, Jan 28, 2020 at 2:55 PM Ryan Blue wrote: > Thanks for pointing out those references, suds! > > And thanks to Mouli (for writing the doc) and Anton (for writing the test)! > > On Tue, Ja

Write reliability in Iceberg

2020-01-28 Thread Gautam
handling write/read consistency cases where the underlying fs doesn't provide atomic apis for file overwrite/rename? We'v outlined the details in the attached issue#758 [1] .. What do folks think? Cheers, -Gautam. [1] - https://github.com/apache/incubator-iceberg/issues/758 [2] - https

Re: [DISCUSS] Forward compatibility and snapshot ID inheritance

2020-01-13 Thread Gautam
A feature flag sounds good to me with associated regression tests to pair along with each feature. Re: Snapshot Id Inheritance, would be good to update the spec with the change in metadata guarantees. -Gautam. On Mon, Jan 13, 2020 at 11:28 AM Ryan Blue wrote: > Hi everyone, > >

Iceberg Vectorized Reads Meeting Notes (Nov 14)

2019-11-14 Thread Gautam
*Vectorization notes (Nov 14) * Attendees: - Anjali - Samarth - Ryan - Gautam Overall things covered: - Current state of performance - How to start getting things from vectorized-read branch into master - Next steps for complex types Current performance: - Reads

Re: [ANNOUNCE] Apache Iceberg release 0.7.0-incubating

2019-10-31 Thread Gautam
Great first release milestone! Looking forward to more work going into this community! Thanks to Ryan for shepherding the release and those who helped verify it. On Mon, Oct 28, 2019 at 10:48 PM Mouli Mukherjee wrote: > Awesome! Congratulations! > > On Mon, Oct 28, 2019 at 9:17 AM Sandeep Sagar

Re: [VOTE] Release Apache Iceberg 0.7.0-incubating RC1

2019-10-14 Thread Gautam
elect the > Apache releases repository. > >>> > >>> I don't think this is a problem with the release. The convenience > binaries in the release must be signed and published from an Apache > repository, so this is necessary. If you're trying to use the releas

Re: [VOTE] Release Apache Iceberg 0.7.0-incubating RC1

2019-10-13 Thread Gautam
I was able to run steps in Ryan's mail just fine but ran into the same thing Arina mentioned .. when running "* ./graldew build publish *" .. A problem was found with the configuration of task ':iceberg-api:signApachePublication'. > No value has been specified for property 'signatory.keyId'.

Iceberg Vectorized Reads Meeting Notes (Oct 7)

2019-10-07 Thread Gautam
Hello Devs, We met to discuss progress and next steps on Vectorized read path in Iceberg. Here are my notes from the sync. Feel free to reply with clarifications in case I mis-quoted or missed anything. *Attendees*: Anjali Norwood Padma Pennumarthy Ryan Blue Samarth Jain Gautam

Re: [DISCUSS] Iceberg community sync?

2019-10-07 Thread Gautam
+1 9 am PST on Tues/Wednesday works. On Mon, Oct 7, 2019 at 4:50 AM Jacques Nadeau wrote: > Tuesdays work best for me. > > On Sun, Oct 6, 2019, 4:18 PM Anton Okolnychyi > wrote: > >> Tuesday/Wednesday/Thursday works fine for me. Anything up to 19:00 UTC / >> 20:00 BST / 12:00 PDT is OK if

Re: Incompatible Writes due to OutOfOrder Fields

2019-09-26 Thread Gautam
above with sample data : https://gist.github.com/prodeezy/b2cc35b87fca7d43ae681d45b3d7cab3 Cheers, -Gautam. On Wed, Sep 25, 2019 at 5:29 AM Ryan Blue wrote: > Hi Shone, > > Iceberg should be able to handle out of order data columns in nested > structures. We probably just n

Iceberg using V1 Vectorized Reader over Parquet ..

2019-09-04 Thread Gautam
ful and we can keep this as an interim solution behind a feature flag, I can get a PR up with proper unit tests. thanks and regards, -Gautam. [1] - https://github.com/apache/incubator-iceberg/issues/9 [2] - https://github.com/apache/incubator-iceberg/tree/vectorized-read [3] - https://github.com/apa

Re: New committer and PPMC member, Anton Okolnychyi

2019-09-02 Thread Gautam
Way to go Anton! Appreciate all the work and guidance. On Tue, Sep 3, 2019 at 9:33 AM John Zhuge wrote: > Congratulations Anton! > > On Mon, Sep 2, 2019 at 8:45 PM Mouli Mukherjee > wrote: > >> Congratulations Anton! >> >> On Mon, Sep 2, 2019, 8:38 PM Saisai Shao wrote: >> >>> Congrats Anton!

Re: Nested Column Pruning in Iceberg (DSV2) ..

2019-08-30 Thread Gautam Kowshik
Super! That’d be great. Lemme know if I can help in any way. Sent from my iPhone > On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi > wrote: > > Hi Gautam, > > Iceberg does support nested schema pruning but Spark doesn’t request this for > DS V2 in 2.4. Internally, we ha

Re: Encouraging performance results for Vectorized Iceberg code

2019-08-08 Thread Gautam Kowshik
t;> are other people in the community that are interested, like Palantir. If >> there isn't anything sensitive then let's try to be more inclusive. Thanks! >> >> rb >> >>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood wrote: >>> Hi Gautam, Padma, >&

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect10k ss5 0.275 ± 0.040 s/op IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect5k ss5 0.273 ± 0.031 s/op On Wed, Jul 31, 2019 at 2:35 PM Anjali Norwood wrote: > Hi Gau

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
Also I think the other thing that's fundamentally different is the way Page iteration and Column iteration are done in Iceberg vs. the way value reading happens in Spark's ValuesReader implementations. On Wed, Jul 31, 2019 at 1:44 PM Gautam wrote: > Hey Samarth, > Sorr

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-31 Thread Gautam
*does this. I'l try and provide more insights once i improve my code. But if there's other insights folks have on where we can improve on things, i'd gladly try them. Cheers, - Gautam. [0] - https://github.com/prodeezy/incubator-iceberg/tree/vectorized-read [1] - https://github.com/prodeezy

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
hub.com/apache/incubator-iceberg/blob/master/build.gradle#L167 > > We'll need to fix the build to disable for the jmh tasks. > > -Dan > > On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks wrote: > >> Gautam, you need to have the jmh-core libraries available to run. I >&g

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
This fails on master too btw. Just wondering if i'm doing something wrong trying to run this. On Fri, Jul 26, 2019 at 2:24 PM Gautam wrote: > I'v been trying to run the jmh benchmarks bundled within the project. I'v > been running into issues with that .. have other hit this? Am I r

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-26 Thread Gautam
PM Ryan Blue wrote: > Thanks Gautam! > > We'll start taking a look at your code. What do you think about creating a > branch in the Iceberg repository where we can work on improving it > together, before merging it into master? > > Also, you mentioned performance comparison

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-24 Thread Gautam
+1 on having a branch. Lemme know once you do i'l rebase and open a PR against it. Will get back to you on perf numbers soon. On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue wrote: > Thanks Gautam! > > We'll start taking a look at your code. What do you think about creating a > branch in

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-23 Thread Gautam
not used. This was from my previous impl of Vectorization. I'v kept it around to compare performance. Lemme know what folks think of the approach. I'm getting this working for our scale test benchmark and will report back with numbers. Feel free to run your own benchmarks and share. Cheers, -Gautam

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Gautam
Will do. Doing a bit of housekeeping on the code and also adding more primitive type support. On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah wrote: > Would it be possible to put the work in progress code in open source? > > > > *From: *Gautam > *Reply-To: *"dev@icebe

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-22 Thread Gautam
That would be great! On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks wrote: > Hey Gautam, > > We also have a couple people looking into vectorized reading (into Arrow > memory). I think it would be good for us to get together and see if we can > collaborate on a common approach for

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-21 Thread Gautam
, 2019 at 5:22 PM Gautam wrote: > Hey Guys, >Sorry bout the delay on this. Just got back on getting a basic > working implementation in Iceberg for Vectorization on primitive types. > > *Here's what I have so far : * > > I have added `ParquetValueReader` implement

Re: Approaching Vectorized Reading in Iceberg ..

2019-07-19 Thread Gautam
emove the projection by reporting the iterator's schema back to Spark*". Is there a simple way to communicate that to Spark for my new iterator? Any pointers on how to get around this? Thanks and Regards, -Gautam. On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue wrote: > Replies inline. > > O

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
one. On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue wrote: > Replies inline. > > On Fri, Jun 14, 2019 at 1:11 AM Gautam wrote: > >> Thanks for responding Ryan, >> >> Couple of follow up questions on ParquetValueReader for Arrow.. >> >> I'd like to start with

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
<*ColumnarBatch*> *so that DataSourceV2ScanExec starts using ColumnarBatch scans That's a lot of questions! :-) but hope i'm making sense. -Gautam. [1] - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala On Thu,

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-14 Thread Gautam
. On Thu, Jun 13, 2019 at 10:56 PM Anton Okolnychyi wrote: > Gautam, could you also share the code for benchmarks and conversion? > > Thanks, > Anton > > On 13 Jun 2019, at 19:38, Ryan Blue wrote: > > Sounds like a good start. I think the nex

Re: Approaching Vectorized Reading in Iceberg ..

2019-06-13 Thread Gautam
ppreciate your guidance, -Gautam. On Fri, May 24, 2019 at 5:28 PM Ryan Blue wrote: > if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an > Iterator[InternalRow] interface, it would still not work right? Coz it > seems to me there is a lot more going on upstream in the operato

Re: Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Gautam
. Then we can add complexity from > there. > > On Fri, May 24, 2019 at 4:28 PM Gautam wrote: > >> Hello devs, >>As a follow up to >> https://github.com/apache/incubator-iceberg/issues/9 I'v been reading >> through how Spark does vectorized reading in

Approaching Vectorized Reading in Iceberg ..

2019-05-24 Thread Gautam
needed between V2 DataSourceReader (like Iceberg) and the operator execution? thank you, -Gautam. [1] - https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L412 [2] - https://github.com/apache/spark

Re: Vanilla Spark Readers on Iceberg written data..

2019-05-15 Thread Gautam
ant datasource patches to Spark 2.3 a non starter? If > this were doable I believe this is much simpler than bypassing Iceberg > metadata to read files directly. > > -R > > On Wed, May 15, 2019 at 3:02 PM Gautam wrote: > >> Just wanted to add, from what I have tested so

Re: Reading dataset with stats making lots of network traffic..

2019-05-02 Thread Gautam
:42 PM Anton Okolnychyi wrote: > Hey Gautam, > > Out of my curiosity, did you manage to confirm the root cause of the issue? > > P.S. I created [1] so that we can make collection of lower/upper bounds > configurable. > > Thanks, > Anton > > [1] - https://github.co

Re: Reading dataset with stats making lots of network traffic..

2019-04-19 Thread Gautam
> The length in bytes of the schema is 109M as compared to 687K of the non-stats dataset. Typo, length in bytes of *manifest*. schema is the same. On Fri, Apr 19, 2019 at 12:16 PM Gautam wrote: > Correction, partition count = 4308. > > > Re: Changing the way we keep stats.

Re: Reading dataset with stats making lots of network traffic..

2019-04-19 Thread Gautam
larger context, 109M is not that much metadata given that Iceberg is meant for datasets where the metadata itself is Bigdata scale. I'm curious on how folks with larger sized metadata (in GB) are optimizing this today. Cheers, -Gautam. On Fri, Apr 19, 2019 at 12:40 AM Ryan Blue wrote: > Tha

Re: Reading dataset with stats making lots of network traffic..

2019-04-19 Thread Gautam
dly for parallelization. thanks. On Fri, Apr 19, 2019 at 12:12 PM Gautam wrote: > Ah, my bad. I missed adding in the schema details .. Here are some details > on the dataset with stats : > > Iceberg Schema Columns : 20 > Spark Schema fields : 20 > Snapshot Summary :{added-d

Reading dataset with stats making lots of network traffic..

2019-04-18 Thread Gautam
ption here that can be leveraged. Would appreciate some guidance. If there isn't a straightforward fix and others feel this is an issue I can raise an issue and look into it further. Cheers, -Gautam.

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-03-06 Thread Gautam
Raised https://github.com/apache/incubator-iceberg/issues/122 for the filtering support. On Wed, Mar 6, 2019 at 1:34 AM Anton Okolnychyi wrote: > Sounds good, Gautam. > > Our intention was to be able to filter out files using predicates on > nested fields. For now, file skippin

Re: [VOTE] Add the python implementation

2019-03-05 Thread Gautam Kowshik
+1 Sent from my iPhone > On Mar 6, 2019, at 6:56 AM, RD wrote: > > +1 > >> On Tue, Mar 5, 2019 at 5:01 PM John Zhuge wrote: >> +1 >> >>> On Tue, Mar 5, 2019 at 4:59 PM Xabriel Collazo Mojica >>> wrote: >>> +1 >>> >>> >>> >>> Xabriel J Collazo Mojica | Senior Software Engineer |

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-03-05 Thread Gautam
will add the struct metrics, I could open a separate Iceberg issue about the struct expression handling. If Ryan and you agree on allowing struct based filtering in Iceberg as long as we avoid mixed filtering (map> , array> , etc.) I can go ahead and work on it. Cheers, -Gautam. On Tue,

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-03-05 Thread Gautam
I think you've solved the problem and correctly built your table metadata using the metrics from the Parquet footers, but I still want to note the distinction: Avro manifests store metrics correctly. Avro data files don't generate metrics. Gotcha! Cheers, -Gautam. [1] - https://github.com/apache/in

Re: [VOTE] Community code reviews

2019-02-28 Thread Gautam Kowshik
+1 Sent from my iPhone > On Feb 28, 2019, at 10:09 PM, Daniel Weeks wrote: > > +1 (binding) > > On 2019/02/27 21:11:01, Ryan Blue wrote: > > This is a follow-up to the discussion thread, where we seem to have> > > consensus around the proposal to allow committers to commit their own pull>

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-28 Thread Gautam
/gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac Cheers, -Gautam. [1] - https://github.com/apache/spark/pull/22573 On Tue, Feb 26, 2019 at 10:35 PM Anton Okolnychyi wrote: > Unfortunately, Spark doesn’t push down filters for nested columns. I > remember an effort to implement it [1]. However, it

Re: Iceberg scans not keeping or using important file/column statistics in manifests ..

2019-02-26 Thread Gautam
.. Just to be clear my concern is around Iceberg not skipping files. Iceberg does skip rowGroups when scanning files as *iceberg.parquet.ParquetReader* uses the parquet stats under it while skipping, albeit none of these stats come from the manifests. On Tue, Feb 26, 2019 at 7:24 PM Gautam wrote

Re: Iceberg fails to return results when filtered on complex columns ..

2019-02-21 Thread Gautam
post scan filters anyways. Let me know what you think, Cheers, -Gautam. [1] - https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetReader.java#L103-L112 [2] - https://github.com/apache/incubator-iceberg/blob/master/parquet/src/mai

Iceberg fails to return results when filtered on complex columns ..

2019-02-18 Thread Gautam
n filtering not returning results in the Iceberg case. I think post scan filtering is unable to handle Iceberg format. So if 1) is not the way forward then the alternative way is to fix this in the post scan filtering. Looking forward to your guidance on the way forward. Cheers, -Gautam. [1] - htt