Congratulations Zheng Hu!
On Tue, Jun 29, 2021 at 8:17 PM OpenInx wrote:
> Thanks all !
>
> I really appreciate the trust from the Apache iceberg community. For me,
> this is not only an honor, but also a responsibility. I'd like to share
> something about the current apache iceberg status in
Congrats Russell!
Sent from my iPhone
> On Mar 29, 2021, at 9:41 AM, Dilip Biswal wrote:
>
>
> Congratulations Russel !! Very well deserved, indeed !!
>
>> On Mon, Mar 29, 2021 at 9:13 AM Miao Wang wrote:
>> Congratulations Russell!
>>
>>
>>
>> Miao
>>
>>
>>
>> From: Szehon Ho
>>
+ dawilcox
On Tue, Jan 26, 2021 at 11:46 AM Gautam wrote:
> Hey Ryan & David,
> I believe this change from you [1] indirectly achieves this.
> David's issue is that every table.load() is instantiating one FS handle for
> each snapshot, and in your change, by con
ing to
> helpful articles, slide decks, etc about Iceberg. In the trenches
> information is often the most useful.
>
> On Fri, Jan 15, 2021 at 3:43 PM Ryan Blue
> wrote:
>
>> Thanks, Gautam! I was just reading the one on query optimizations. Great
>> that you are writ
ceberg at Adobe" *[2]*
and "High Throughput Ingestion with Iceberg" *[3]*.
Hoping these are helpful to others..
thanks and regards,
-Gautam.
[1] -
https://medium.com/adobetech/taking-query-optimizations-to-the-next-level-with-iceberg-6c968b83cd6f
[2] - https://medium.com/adobetech
raised.
Regards,
-Gautam.
On Wed, Sep 9, 2020 at 5:07 PM Ryan Blue wrote:
> Hi everyone, I'm putting this on the agenda for today's Iceberg sync.
>
> Also, I want to point out John's recent PR that added a way to inject a
> Clock that is used for timestamp generation:
> htt
this today or
are they not exposing such a feature at all due to the inherent distributed
timing problem? Would like to hear how others are thinking/going about
this. Thoughts?
Cheers,
-Gautam.
Congratulations Shardul!
On Thu, Jul 23, 2020 at 12:24 AM Shardul Mahadik
wrote:
> Thanks everyone!!
>
> Best,
> Shardul
>
> On 2020/07/23 06:52:57, "Driesprong, Fokko" wrote:
> > Congrats Shardul! Great work!
> >
> > Cheers, Fokko
> >
> > Op do 23 jul. 2020 om 07:46 schreef Miao Wang >:
> >
*Followed the steps:*
1. Downloaded the source tarball, signature (.asc), and checksum (.sha512)
from
https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
2. Downloaded https://dist.apache.org/repos/dist/dev/incubator/iceberg/KEYS
Import gpg keys: download KEYS and run gpg
+1 We'v come a long way :-)
On Wed, May 13, 2020 at 1:07 AM Dongjoon Hyun
wrote:
> +1 for graduation!
>
> Bests,
> Dongjoon.
>
> On Tue, May 12, 2020 at 11:59 PM Driesprong, Fokko
> wrote:
>
>> +1
>>
>> Op wo 13 mei 2020 om 08:58 schreef jiantao yu
>>
>>> +1 for graduation.
>>>
>>>
>>> 在
My 2 cents :
> * Merge manifest_entry and data_file?
... -1 .. keeping the difference between v1 and v2 metadata to a
minimum would be my preference by keeping manifest_entries the same way in
both v1 and v2. People using either flows would want to modify and
contribute and shouldn't
Ran checks on
https://dist.apache.org/repos/dist/dev/incubator/iceberg/apache-iceberg-0.8.0-incubating-rc2/
√ RAT checks passed
√ signature is correct
√ checksum is correct
√ build from source (with java 8)
√ run tests locally
+1 (non-binding)
On Thu, Apr 30, 2020 at 4:18 PM Samarth Jain
can start taking
those up too.
thanks for the good work,
- Gautam.
On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen
wrote:
> +1 to create the branch. Some row-level delete subtasks must be based on
> the sequence number as well as end to end tests.
>
> On Fri, Mar 27, 2020 at 4:4
5 / 5:30pm any day of next week works for me.
On Thu, Mar 19, 2020 at 6:07 PM 李响 wrote:
> 5 or 5:30 PM (UTC-7, is it PDT now) in any day works for me.
> Looking forward to it 8-)
>
> On Fri, Mar 20, 2020 at 8:17 AM RD wrote:
>
>> Same time works for me too!
>>
>> On Thu, Mar 19, 2020 at 4:45
+1 for Monthly/fort-nightly and 5pm PST
What day are we thinking for next meeting?
On Wed, Mar 18, 2020 at 1:30 PM RD wrote:
> +1
>
> On Wed, Mar 18, 2020 at 10:49 AM Ryan Blue
> wrote:
>
>> No problem, we can alternate times to include everyone. How about the
>> next sync at 5 PM UTC+7 and
Congratulations and thanks for your work.
On Sun, Feb 16, 2020 at 8:37 PM RD wrote:
> Thanks everyone!
>
> -Best,
> R.
>
> On Sun, Feb 16, 2020 at 7:39 PM David Christle
> wrote:
>
>> Congrats!!!
>>
>>
>>
>> *From: *Jacques Nadeau
>> *Reply-To: *"dev@iceberg.apache.org"
>> *Date: *Sunday,
CustomTableOperations's *doCommit* implementation.
Thanks for the guidance,
-Gautam.
On Tue, Jan 28, 2020 at 2:55 PM Ryan Blue wrote:
> Thanks for pointing out those references, suds!
>
> And thanks to Mouli (for writing the doc) and Anton (for writing the test)!
>
> On Tue, Ja
handling write/read consistency cases where the underlying fs
doesn't provide atomic apis for file overwrite/rename? We'v outlined the
details in the attached issue#758 [1] .. What do folks think?
Cheers,
-Gautam.
[1] - https://github.com/apache/incubator-iceberg/issues/758
[2] - https
A feature flag sounds good to me with associated regression tests to pair
along with each feature.
Re: Snapshot Id Inheritance, would be good to update the spec with the
change in metadata guarantees.
-Gautam.
On Mon, Jan 13, 2020 at 11:28 AM Ryan Blue
wrote:
> Hi everyone,
>
>
*Vectorization notes (Nov 14) *
Attendees:
- Anjali
- Samarth
- Ryan
- Gautam
Overall things covered:
- Current state of performance
- How to start getting things from vectorized-read branch into master
- Next steps for complex types
Current performance:
- Reads
Great first release milestone! Looking forward to more work going into this
community! Thanks to Ryan for shepherding the release and those who helped
verify it.
On Mon, Oct 28, 2019 at 10:48 PM Mouli Mukherjee
wrote:
> Awesome! Congratulations!
>
> On Mon, Oct 28, 2019 at 9:17 AM Sandeep Sagar
elect the
> Apache releases repository.
> >>>
> >>> I don't think this is a problem with the release. The convenience
> binaries in the release must be signed and published from an Apache
> repository, so this is necessary. If you're trying to use the releas
I was able to run steps in Ryan's mail just fine but ran into the same
thing Arina mentioned .. when running "* ./graldew build publish *" ..
A problem was found with the configuration of task
':iceberg-api:signApachePublication'.
> No value has been specified for property 'signatory.keyId'.
Hello Devs,
We met to discuss progress and next steps on Vectorized
read path in Iceberg. Here are my notes from the sync. Feel free to reply
with clarifications in case I mis-quoted or missed anything.
*Attendees*:
Anjali Norwood
Padma Pennumarthy
Ryan Blue
Samarth Jain
Gautam
+1 9 am PST on Tues/Wednesday works.
On Mon, Oct 7, 2019 at 4:50 AM Jacques Nadeau wrote:
> Tuesdays work best for me.
>
> On Sun, Oct 6, 2019, 4:18 PM Anton Okolnychyi
> wrote:
>
>> Tuesday/Wednesday/Thursday works fine for me. Anything up to 19:00 UTC /
>> 20:00 BST / 12:00 PDT is OK if
above with sample
data : https://gist.github.com/prodeezy/b2cc35b87fca7d43ae681d45b3d7cab3
Cheers,
-Gautam.
On Wed, Sep 25, 2019 at 5:29 AM Ryan Blue wrote:
> Hi Shone,
>
> Iceberg should be able to handle out of order data columns in nested
> structures. We probably just n
ful and we can keep this as an interim solution
behind a feature flag, I can get a PR up with proper unit tests.
thanks and regards,
-Gautam.
[1] - https://github.com/apache/incubator-iceberg/issues/9
[2] - https://github.com/apache/incubator-iceberg/tree/vectorized-read
[3] -
https://github.com/apa
Way to go Anton! Appreciate all the work and guidance.
On Tue, Sep 3, 2019 at 9:33 AM John Zhuge wrote:
> Congratulations Anton!
>
> On Mon, Sep 2, 2019 at 8:45 PM Mouli Mukherjee
> wrote:
>
>> Congratulations Anton!
>>
>> On Mon, Sep 2, 2019, 8:38 PM Saisai Shao wrote:
>>
>>> Congrats Anton!
Super! That’d be great. Lemme know if I can help in any way.
Sent from my iPhone
> On Aug 30, 2019, at 6:30 PM, Anton Okolnychyi
> wrote:
>
> Hi Gautam,
>
> Iceberg does support nested schema pruning but Spark doesn’t request this for
> DS V2 in 2.4. Internally, we ha
t;> are other people in the community that are interested, like Palantir. If
>> there isn't anything sensitive then let's try to be more inclusive. Thanks!
>>
>> rb
>>
>>> On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood wrote:
>>> Hi Gautam, Padma,
>&
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect10k
ss5 0.275 ± 0.040 s/op
IcebergSourceFlatParquetDataFilterBenchmark.readWithFilterIcebergVect5k
ss5 0.273 ± 0.031 s/op
On Wed, Jul 31, 2019 at 2:35 PM Anjali Norwood
wrote:
> Hi Gau
Also I think the other thing that's fundamentally different is the way Page
iteration and Column iteration are done in Iceberg vs. the way value
reading happens in Spark's ValuesReader implementations.
On Wed, Jul 31, 2019 at 1:44 PM Gautam wrote:
> Hey Samarth,
> Sorr
*does this.
I'l try and provide more insights once i improve my code. But if there's
other insights folks have on where we can improve on things, i'd gladly try
them.
Cheers,
- Gautam.
[0] - https://github.com/prodeezy/incubator-iceberg/tree/vectorized-read
[1] -
https://github.com/prodeezy
hub.com/apache/incubator-iceberg/blob/master/build.gradle#L167
>
> We'll need to fix the build to disable for the jmh tasks.
>
> -Dan
>
> On Fri, Jul 26, 2019 at 3:34 PM Daniel Weeks wrote:
>
>> Gautam, you need to have the jmh-core libraries available to run. I
>&g
This fails on master too btw. Just wondering if i'm doing something wrong
trying to run this.
On Fri, Jul 26, 2019 at 2:24 PM Gautam wrote:
> I'v been trying to run the jmh benchmarks bundled within the project. I'v
> been running into issues with that .. have other hit this? Am I r
PM Ryan Blue wrote:
> Thanks Gautam!
>
> We'll start taking a look at your code. What do you think about creating a
> branch in the Iceberg repository where we can work on improving it
> together, before merging it into master?
>
> Also, you mentioned performance comparison
+1 on having a branch. Lemme know once you do i'l rebase and open a PR
against it.
Will get back to you on perf numbers soon.
On Wed, Jul 24, 2019 at 2:03 PM Ryan Blue wrote:
> Thanks Gautam!
>
> We'll start taking a look at your code. What do you think about creating a
> branch in
not used. This was from my previous impl of Vectorization. I'v kept it
around to compare performance.
Lemme know what folks think of the approach. I'm getting this working for
our scale test benchmark and will report back with numbers. Feel free to
run your own benchmarks and share.
Cheers,
-Gautam
Will do. Doing a bit of housekeeping on the code and also adding more
primitive type support.
On Mon, Jul 22, 2019 at 1:41 PM Matt Cheah wrote:
> Would it be possible to put the work in progress code in open source?
>
>
>
> *From: *Gautam
> *Reply-To: *"dev@icebe
That would be great!
On Mon, Jul 22, 2019 at 9:12 AM Daniel Weeks wrote:
> Hey Gautam,
>
> We also have a couple people looking into vectorized reading (into Arrow
> memory). I think it would be good for us to get together and see if we can
> collaborate on a common approach for
, 2019 at 5:22 PM Gautam wrote:
> Hey Guys,
>Sorry bout the delay on this. Just got back on getting a basic
> working implementation in Iceberg for Vectorization on primitive types.
>
> *Here's what I have so far : *
>
> I have added `ParquetValueReader` implement
emove the projection by reporting
the iterator's schema back to Spark*". Is there a simple way to
communicate that to Spark for my new iterator? Any pointers on how to get
around this?
Thanks and Regards,
-Gautam.
On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue wrote:
> Replies inline.
>
> O
one.
On Fri, Jun 14, 2019 at 4:22 PM Ryan Blue wrote:
> Replies inline.
>
> On Fri, Jun 14, 2019 at 1:11 AM Gautam wrote:
>
>> Thanks for responding Ryan,
>>
>> Couple of follow up questions on ParquetValueReader for Arrow..
>>
>> I'd like to start with
<*ColumnarBatch*> *so that
DataSourceV2ScanExec starts using ColumnarBatch scans
That's a lot of questions! :-) but hope i'm making sense.
-Gautam.
[1] -
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowWriter.scala
On Thu,
.
On Thu, Jun 13, 2019 at 10:56 PM Anton Okolnychyi
wrote:
> Gautam, could you also share the code for benchmarks and conversion?
>
> Thanks,
> Anton
>
> On 13 Jun 2019, at 19:38, Ryan Blue wrote:
>
> Sounds like a good start. I think the nex
ppreciate your guidance,
-Gautam.
On Fri, May 24, 2019 at 5:28 PM Ryan Blue wrote:
> if Iceberg Reader was to wrap Arrow or ColumnarBatch behind an
> Iterator[InternalRow] interface, it would still not work right? Coz it
> seems to me there is a lot more going on upstream in the operato
. Then we can add complexity from
> there.
>
> On Fri, May 24, 2019 at 4:28 PM Gautam wrote:
>
>> Hello devs,
>>As a follow up to
>> https://github.com/apache/incubator-iceberg/issues/9 I'v been reading
>> through how Spark does vectorized reading in
needed between V2 DataSourceReader (like
Iceberg) and the operator execution?
thank you,
-Gautam.
[1] -
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L412
[2] -
https://github.com/apache/spark
ant datasource patches to Spark 2.3 a non starter? If
> this were doable I believe this is much simpler than bypassing Iceberg
> metadata to read files directly.
>
> -R
>
> On Wed, May 15, 2019 at 3:02 PM Gautam wrote:
>
>> Just wanted to add, from what I have tested so
:42 PM Anton Okolnychyi
wrote:
> Hey Gautam,
>
> Out of my curiosity, did you manage to confirm the root cause of the issue?
>
> P.S. I created [1] so that we can make collection of lower/upper bounds
> configurable.
>
> Thanks,
> Anton
>
> [1] - https://github.co
> The length in bytes of the schema is 109M as compared to 687K of the
non-stats dataset.
Typo, length in bytes of *manifest*. schema is the same.
On Fri, Apr 19, 2019 at 12:16 PM Gautam wrote:
> Correction, partition count = 4308.
>
> > Re: Changing the way we keep stats.
larger context, 109M is not that much
metadata given that Iceberg is meant for datasets where the metadata itself
is Bigdata scale. I'm curious on how folks with larger sized metadata (in
GB) are optimizing this today.
Cheers,
-Gautam.
On Fri, Apr 19, 2019 at 12:40 AM Ryan Blue
wrote:
> Tha
dly for parallelization.
thanks.
On Fri, Apr 19, 2019 at 12:12 PM Gautam wrote:
> Ah, my bad. I missed adding in the schema details .. Here are some details
> on the dataset with stats :
>
> Iceberg Schema Columns : 20
> Spark Schema fields : 20
> Snapshot Summary :{added-d
ption here that can be leveraged. Would appreciate some guidance. If
there isn't a straightforward fix and others feel this is an issue I can
raise an issue and look into it further.
Cheers,
-Gautam.
Raised https://github.com/apache/incubator-iceberg/issues/122 for the
filtering support.
On Wed, Mar 6, 2019 at 1:34 AM Anton Okolnychyi
wrote:
> Sounds good, Gautam.
>
> Our intention was to be able to filter out files using predicates on
> nested fields. For now, file skippin
+1
Sent from my iPhone
> On Mar 6, 2019, at 6:56 AM, RD wrote:
>
> +1
>
>> On Tue, Mar 5, 2019 at 5:01 PM John Zhuge wrote:
>> +1
>>
>>> On Tue, Mar 5, 2019 at 4:59 PM Xabriel Collazo Mojica
>>> wrote:
>>> +1
>>>
>>>
>>>
>>> Xabriel J Collazo Mojica | Senior Software Engineer |
will add the
struct metrics, I could open a separate Iceberg issue about the struct
expression handling. If Ryan and you agree on allowing struct based
filtering in Iceberg as long as we avoid mixed filtering (map>
, array> , etc.) I can go ahead and work on it.
Cheers,
-Gautam.
On Tue,
I think you've solved
the problem and correctly built your table metadata using the metrics from
the Parquet footers, but I still want to note the distinction: Avro
manifests store metrics correctly. Avro data files don't generate metrics.
Gotcha!
Cheers,
-Gautam.
[1] - https://github.com/apache/in
+1
Sent from my iPhone
> On Feb 28, 2019, at 10:09 PM, Daniel Weeks wrote:
>
> +1 (binding)
>
> On 2019/02/27 21:11:01, Ryan Blue wrote:
> > This is a follow-up to the discussion thread, where we seem to have>
> > consensus around the proposal to allow committers to commit their own pull>
/gist.github.com/prodeezy/001cf155ff0675be7d307e9f842e1dac
Cheers,
-Gautam.
[1] - https://github.com/apache/spark/pull/22573
On Tue, Feb 26, 2019 at 10:35 PM Anton Okolnychyi
wrote:
> Unfortunately, Spark doesn’t push down filters for nested columns. I
> remember an effort to implement it [1]. However, it
.. Just to be clear my concern is around Iceberg not skipping files.
Iceberg does skip rowGroups when scanning files as
*iceberg.parquet.ParquetReader* uses the parquet stats under it while
skipping, albeit none of these stats come from the manifests.
On Tue, Feb 26, 2019 at 7:24 PM Gautam wrote
post scan filters anyways.
Let me know what you think,
Cheers,
-Gautam.
[1] -
https://github.com/apache/incubator-iceberg/blob/master/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetReader.java#L103-L112
[2] -
https://github.com/apache/incubator-iceberg/blob/master/parquet/src/mai
n filtering not returning
results in the Iceberg case. I think post scan filtering is unable to
handle Iceberg format. So if 1) is not the way forward then the alternative
way is to fix this in the post scan filtering.
Looking forward to your guidance on the way forward.
Cheers,
-Gautam.
[1] -
htt
63 matches
Mail list logo