Re: [ANNOUNCE] New committer: Honah J.

2024-01-14 Thread OpenInx
Congrats, Honah !

On Sun, Jan 14, 2024 at 1:25 AM Jun H.  wrote:

> Congratulations!
>
> On Jan 12, 2024, at 10:12 PM, Péter Váry 
> wrote:
>
> 
> Congratulations!
>
> On Sat, Jan 13, 2024, 06:26 Jean-Baptiste Onofré  wrote:
>
>> Congrats !
>>
>> Regards
>> JB
>>
>> Le ven. 12 janv. 2024 à 22:11, Fokko Driesprong  a
>> écrit :
>>
>>> On behalf of the Iceberg PMC, I'm happy to announce that Honah has
>>> accepted an invitation to become a committer on Apache (Py)Iceberg.
>>> Welcome, and thank you for your contributions!
>>>
>>> Kind regards,
>>> Fokko
>>>
>>


Re: Spark cannot read iceberg tables which were originally written by Impala

2024-01-03 Thread OpenInx
Hi Zotan

Thanks for the issue,  I think it's fair to wait for a new major release
for this breaking change.

Best Regards.

On Wed, Jan 3, 2024 at 11:16 PM Zoltán Borók-Nagy
 wrote:

> Hi,
>
> I created a IMPALA-12675
> <https://issues.apache.org/jira/browse/IMPALA-12675> about annotating
> STRINGs with UTF8 by default. The code change should be trivial, but I'm
> afraid we will need to wait for a new major release with this (because
> users might store binary data in STRING columns, so it would be a breaking
> change for them). Until then users can set PARQUET_ANNOTATE_STRINGS_UTF8
> for themselves.
>
> Approach C: Yeah, if Approach A goes through then we don't really need to
> bother with this.
>
> Cheers,
> Zoltan
>
>
> On Wed, Jan 3, 2024 at 2:02 PM OpenInx  wrote:
>
>> Thanks Zoltan and Ryan for your feedback.
>>
>> I think we all agreed that adding an option to promote BINARY to String
>> (Approach A) in flink/spark/hive reader sides to read those historic
>> dataset correctly written by impala on hive already.  Besides that,
>> applying approach B to future Apache Impala releases also sounds
>> reasonable
>> to me, I think we can also create a PR in apache impala repo at the same
>> time when applying approach A to iceberg repo.
>>
>> About approach C, I guess those parquet files will also need to be totally
>> rewritten although we are only trying to change those file metadata, which
>> may be costly. So I'm a bit hesitant to choose this approach.
>>
>> Jiafei and I will try to create two PRs for the two things (A and B), one
>> for apache iceberg repo and another one for apache impala repo.
>>
>> Best regards.
>>
>> On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue  wrote:
>>
>> > Thanks for bringing this up and for finding the cause.
>> >
>> > I think we should add an option to promote binary to string (Approach
>> A).
>> > That sounds pretty reasonable overall. I think it would be great if
>> Impala
>> > also produced correct Parquet files, but that's beyond our control and
>> > there's, no doubt, a ton of data already in that format.
>> >
>> > This could also be part of our v3 work, where I think we intend to add
>> > binary to string type promotion to the format.
>> >
>> > On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy <
>> borokna...@apache.org>
>> > wrote:
>> >
>> >> Hey Everyone,
>> >>
>> >> Thank you for raising this issue and reaching out to the Impala
>> community.
>> >>
>> >> Let me clarify that the problem only happens when there is a legacy
>> Hive
>> >> table written by Impala, which is then converted to Iceberg. When
>> Impala
>> >> writes into an Iceberg table there is no problem with interoperability.
>> >>
>> >> The root cause is that Impala only supports the BINARY type recently.
>> And
>> >> the STRING type could serve as a workaround to store binary data. This
>> is
>> >> why Impala does not add the UTF8 annotation for STRING columns in
>> legacy
>> >> Hive tables. (Again, for Iceberg tables Impala adds the UTF8
>> annotation.)
>> >>
>> >> Later, when the table is converted to Iceberg, the migration process
>> does
>> >> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER
>> TABLE
>> >> CONVERT TO statement.
>> >>
>> >> My comments about the proposed solutions, and also adding another one,
>> >> (Approach C):
>> >>
>> >> Approach A (promote BINARY to UTF8 during reads): I think it makes
>> sense.
>> >> The Parquet metadata also stores information about the writer, so if we
>> >> want this to be a very specific fix, we can check if the writer was
>> indeed
>> >> Impala.
>> >>
>> >> Approach B (Impala should annotate STRING columns with UTF8): This
>> >> probably can only be fixed in a new major version of Impala. Impala
>> >> supports the BINARY type now, so I think it makes sense to limit the
>> STRING
>> >> type to actual string data. This approach does not fix already written
>> >> files, as you already pointed out.
>> >>
>> >> Approach C: Migration job could copy data files but rewrite file
>> >> metadata, if needed. This makes migration slower, but it's probably
>> still
>> >> faster than a CREATE TABLE AS SELECT.
>> >>
>> >> At

Re: Spark cannot read iceberg tables which were originally written by Impala

2024-01-03 Thread OpenInx
Thanks Zoltan and Ryan for your feedback.

I think we all agreed that adding an option to promote BINARY to String
(Approach A) in flink/spark/hive reader sides to read those historic
dataset correctly written by impala on hive already.  Besides that,
applying approach B to future Apache Impala releases also sounds reasonable
to me, I think we can also create a PR in apache impala repo at the same
time when applying approach A to iceberg repo.

About approach C, I guess those parquet files will also need to be totally
rewritten although we are only trying to change those file metadata, which
may be costly. So I'm a bit hesitant to choose this approach.

Jiafei and I will try to create two PRs for the two things (A and B), one
for apache iceberg repo and another one for apache impala repo.

Best regards.

On Tue, Jan 2, 2024 at 2:49 AM Ryan Blue  wrote:

> Thanks for bringing this up and for finding the cause.
>
> I think we should add an option to promote binary to string (Approach A).
> That sounds pretty reasonable overall. I think it would be great if Impala
> also produced correct Parquet files, but that's beyond our control and
> there's, no doubt, a ton of data already in that format.
>
> This could also be part of our v3 work, where I think we intend to add
> binary to string type promotion to the format.
>
> On Tue, Dec 26, 2023 at 2:38 PM Zoltán Borók-Nagy 
> wrote:
>
>> Hey Everyone,
>>
>> Thank you for raising this issue and reaching out to the Impala community.
>>
>> Let me clarify that the problem only happens when there is a legacy Hive
>> table written by Impala, which is then converted to Iceberg. When Impala
>> writes into an Iceberg table there is no problem with interoperability.
>>
>> The root cause is that Impala only supports the BINARY type recently. And
>> the STRING type could serve as a workaround to store binary data. This is
>> why Impala does not add the UTF8 annotation for STRING columns in legacy
>> Hive tables. (Again, for Iceberg tables Impala adds the UTF8 annotation.)
>>
>> Later, when the table is converted to Iceberg, the migration process does
>> not rewrite the datafiles. Neither Spark, neither Impala's own ALTER TABLE
>> CONVERT TO statement.
>>
>> My comments about the proposed solutions, and also adding another one,
>> (Approach C):
>>
>> Approach A (promote BINARY to UTF8 during reads): I think it makes sense.
>> The Parquet metadata also stores information about the writer, so if we
>> want this to be a very specific fix, we can check if the writer was indeed
>> Impala.
>>
>> Approach B (Impala should annotate STRING columns with UTF8): This
>> probably can only be fixed in a new major version of Impala. Impala
>> supports the BINARY type now, so I think it makes sense to limit the STRING
>> type to actual string data. This approach does not fix already written
>> files, as you already pointed out.
>>
>> Approach C: Migration job could copy data files but rewrite file
>> metadata, if needed. This makes migration slower, but it's probably still
>> faster than a CREATE TABLE AS SELECT.
>>
>> At Impala-side we surely need to update our docs about migration and
>> interoperability.
>>
>> Cheers,
>>Zoltan
>>
>> OpenInx  ezt írta (időpont: 2023. dec. 26., K 7:40):
>>
>>> Hi dev
>>>
>>> Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug
>>> in their real customer production environment.
>>> Their customers use Apache Impala to create a large mount of Apache Hive
>>> tables in HMS, and ingested PB-level dataset
>>> in their hive table (which were originally written by Apache Impala).
>>>  In
>>> recent days,  their customers migrated those Hive
>>> tables to Apache Iceberg tables, but failed to query their huge dataset
>>> in
>>> iceberg table format by using the Apache Spark.
>>>
>>> Jiajie Feng (from Sensordata) and I had wrote a simple demo to
>>> demonstrate
>>> this issue, for more details please see below:
>>>
>>> https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing
>>>
>>> We'd like to hear the feedback and suggestions from both the impala and
>>> iceberg community. I think both Jiajie and I would like
>>> to fix this issue if we had an aligned solution.
>>>
>>> Best Regards.
>>>
>>> 1. https://www.sensorsdata.com/en/
>>>
>>
>
> --
> Ryan Blue
> Tabular
>


Spark cannot read iceberg tables which were originally written by Impala

2023-12-25 Thread OpenInx
Hi dev

Sensordata [1] had encountered an interesting Apache Impala & Iceberg bug
in their real customer production environment.
Their customers use Apache Impala to create a large mount of Apache Hive
tables in HMS, and ingested PB-level dataset
in their hive table (which were originally written by Apache Impala).   In
recent days,  their customers migrated those Hive
tables to Apache Iceberg tables, but failed to query their huge dataset in
iceberg table format by using the Apache Spark.

Jiajie Feng (from Sensordata) and I had wrote a simple demo to demonstrate
this issue, for more details please see below:
https://docs.google.com/document/d/1uXgj7GGp59K_hnV3gKWOsI2ljFTKcKBP1hb_Ux_HXuY/edit?usp=sharing

We'd like to hear the feedback and suggestions from both the impala and
iceberg community. I think both Jiajie and I would like
to fix this issue if we had an aligned solution.

Best Regards.

1. https://www.sensorsdata.com/en/


Re: RFC: Control flink upsert sink’s memory usage of insertedRowMap

2023-12-10 Thread OpenInx
https://github.com/apache/iceberg/pull/2680/files

On Mon, Dec 11, 2023 at 11:15 AM OpenInx  wrote:

> Just provided a little context: there was a stale PR which was trying to
> maintain the insertedRowMap into RocksDB..
>
> On Sat, Dec 9, 2023 at 1:52 AM Ryan Blue  wrote:
>
>> Thanks, Renjie!
>>
>> The option to use Flink's state tracking system seems like a good idea to
>> me.
>>
>> On Thu, Dec 7, 2023 at 8:19 PM Renjie Liu 
>> wrote:
>>
>>> Hi:
>>> I want to raise a discussion about controlling flink's upsert sink's
>>> memory usage:
>>>
>>> https://toys-flash-4hl.craft.me/3VHrdWbV30QMk6
>>>
>>> Welcome to comment and share your thoughts.
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: RFC: Control flink upsert sink’s memory usage of insertedRowMap

2023-12-10 Thread OpenInx
Just provided a little context: there was a stale PR which was trying to
maintain the insertedRowMap into RocksDB..

On Sat, Dec 9, 2023 at 1:52 AM Ryan Blue  wrote:

> Thanks, Renjie!
>
> The option to use Flink's state tracking system seems like a good idea to
> me.
>
> On Thu, Dec 7, 2023 at 8:19 PM Renjie Liu  wrote:
>
>> Hi:
>> I want to raise a discussion about controlling flink's upsert sink's
>> memory usage:
>>
>> https://toys-flash-4hl.craft.me/3VHrdWbV30QMk6
>>
>> Welcome to comment and share your thoughts.
>>
>
>
> --
> Ryan Blue
> Tabular
>


Re: Welcome new PMC members!

2023-04-11 Thread OpenInx
Congrats !

On Wed, Apr 12, 2023 at 10:25 AM Junjie Chen 
wrote:

> Congratulations to all of you!
>
> On Wed, Apr 12, 2023 at 10:07 AM Reo Lei  wrote:
>
>> Congratulations!!!
>>
>> yuxia  于2023年4月12日周三 09:19写道:
>>
>>> Congratulations to all!
>>>
>>> Best regards,
>>> Yuxia
>>>
>>> --
>>> *发件人: *"Russell Spitzer" 
>>> *收件人: *"dev" 
>>> *发送时间: *星期三, 2023年 4 月 12日 上午 6:13:01
>>> *主题: *Re: Welcome new PMC members!
>>>
>>> Great news, Congratulations to all!
>>>
>>> On Apr 11, 2023, at 5:11 PM, Dmitri Bourlatchkov <
>>> dmitri.bourlatch...@dremio.com.INVALID> wrote:
>>>
>>> Congratulations Fokko, Steven, and Yufei!
>>>
>>> On Tue, Apr 11, 2023 at 5:22 PM Ryan Blue  wrote:
>>>
 Hi everyone!

 I want to congratulate 3 new PMC members, Fokko Driesprong, Steven Wu,
 and Yufei Gu. Thanks for all your contributions!

 I was going to wait a little longer to announce, but since they're in
 our board report it's already out.

 Ryan

 --
 Ryan Blue
 Tabular

>>>
>>>
>>>
>
> --
> Best Regards
>


Re: In Remembrance of Kyle

2022-12-07 Thread OpenInx
So sad to get this news...I lost such a great, kind,  passionate friend.

On Thu, Dec 8, 2022 at 1:36 AM Ryan Blue  wrote:

> I'm going to miss Kyle and I'm sad to lose him.
>
> He was amazing at making everyone feel welcome here. I think he commented
> on nearly every pull request for the last few years and he always showed
> people that this community values everyone's input and contributions.
> He's irreplaceable and I hope we continue to welcome contributors half as
> well now that he's gone.
>
> I'll also miss his sense of humor. We would occasionally play games
> together for team building at Tabular and he would constantly have his own
> fun. He never played to win, but loved suggesting the wrong answers to the
> other team in codewords or setting time limits far too low just to see the
> rest of us scramble in drawful. He was a lovely guy to know.
>
> Ryan
>
> On Tue, Dec 6, 2022 at 9:23 PM Rajarshi Sarkar 
> wrote:
>
>> I am extremely shocked and saddened to hear about Kyle's passing. I
>> remember how passionate and welcoming he was in our interactions over
>> PRs/Slack. Rest in peace, Kyle. You will be truly missed.
>>
>> Regards,
>> Rajarshi Sarkar
>>
>>
>> On Wed, Dec 7, 2022 at 8:28 AM Jahagirdar, Amogh
>>  wrote:
>>
>>> I’m deeply saddened by this. Kyle is someone I looked up to, and his
>>> passion for Iceberg and open source in general was truly amazing. He always
>>> was active in the community and I learned a lot from him in my discussions
>>> with him on PRs; it was an honor to work with Kyle. I will truly miss him.
>>> Rest in peace Kyle.
>>>
>>> My thoughts and best wishes go out to his family.
>>>
>>>
>>>
>>> -Amogh
>>>
>>>
>>>
>>> *From: *Sreeram Garlapati 
>>> *Reply-To: *"dev@iceberg.apache.org" 
>>> *Date: *Tuesday, December 6, 2022 at 1:38 PM
>>> *To: *"dev@iceberg.apache.org" 
>>> *Subject: *RE: [EXTERNAL]In Remembrance of Kyle
>>>
>>>
>>>
>>> *CAUTION*: This email originated from outside of the organization. Do
>>> not click links or open attachments unless you can confirm the sender and
>>> know the content is safe.
>>>
>>>
>>>
>>> Very very sad and shocking news. Am very very sorry to hear this.
>>>
>>>
>>>
>>> Kyle is a great guy that turned out to be a friend. I was able to lean
>>> on his passion for iceberg - be it code or slack questions. Kyle is one of
>>> the guys - as to what formulates the iceberg community for me & that will
>>> never be the same without him.
>>>
>>>
>>>
>>> My thoughts are with him & his loved ones.
>>>
>>>
>>>
>>> ~Sreeram
>>>
>>>
>>>
>>> On Tue, Dec 6, 2022 at 11:11 AM Jack Ye  wrote:
>>>
>>> Very shocked and sad about this...
>>>
>>>
>>>
>>> Kyle is a genuine friend I made through the Iceberg project. He
>>> commented almost all my PRs, provided insightful suggestions and always
>>> influences me with his passion in the community and the project. I still
>>> remember chatting with him on slack about him changing job, moving home,
>>> and we constantly shared different things happened in work and life with
>>> each other. The community will be different without him, but I believe his
>>> spirit and passion will be passed down.
>>>
>>>
>>>
>>> Rest in peace, Kyle.
>>>
>>>
>>>
>>> Best,
>>>
>>> Jack Ye
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Dec 6, 2022 at 10:17 AM Szehon Ho 
>>> wrote:
>>>
>>> Very shocked when I first heard this over the weekend.  Became more sad
>>> when I learned how long he was sick for, and so humbled that he chose to
>>> spend so much of his last days with us in the Iceberg community.
>>>
>>>
>>>
>>> I did not have a chance to work directly with him in Apple as I was on a
>>> different team.  But it is hard to not see Kyle's buoyant personality
>>> over-flowing everywhere in the Iceberg mailing lists.  He is undoubtedly
>>> the most positive person I've ever met in open source.  He brought a
>>> much-needed human side to what is usually an impersonal open source
>>> community mostly conducted through emails, you could tell by each email how
>>> genuinely passionate he was about building the community, something quite
>>> rare.  As some have said, it is something we can all learn from, and the
>>> Iceberg community will hopefully not lose.
>>>
>>>
>>>
>>> May he rest in peace, and best wishes to his family.
>>>
>>>
>>>
>>> Szehon
>>>
>>>
>>>
>>> On Mon, Dec 5, 2022 at 10:05 PM Manish Malhotra <
>>> manish.malhotra.w...@gmail.com> wrote:
>>>
>>> Very sad and depressing news!!
>>>
>>> Difficult to believe for such a smiling and jubilant person.
>>>
>>>
>>>
>>> Have not worked directly with him, but have followed some of his threads
>>> in Iceberg and Trino channels.
>>>
>>> I have also interacted briefly while he was at Apple. But knows that he
>>> has big contribution to the Iceberg and other OSS projects and can
>>>
>>>
>>>
>>> Praying for him to rest in peace!
>>>
>>>
>>>
>>> - Manish
>>>
>>>
>>>
>>> On Mon, Dec 5, 2022 at 1:12 PM Prashant Singh 
>>> wrote:
>>>
>>> Exteremly sad to hear this news, 

Re: [VOTE] Release Apache Iceberg 1.1.0 RC4

2022-11-27 Thread OpenInx
+1 (binding)

1. Download the source tarball, signature (.asc), and checksum (.sha512):
OK
2. Import gpg keys: download KEYS and run gpg --import
/path/to/downloaded/KEYS.txt (optional if this hasn’t changed) :  OK
3. Verify the signature by running: gpg --verify
apache-iceberg-1.1.0.tar.gz.asc :  OK
4. Verify the checksum by running: shasum -a 512
apache-iceberg-1.1.0.tar.gz  :  OK
5. Untar the archive and go into the source directory: tar xzvf
apache-iceberg-1.1.0.tar.gz && cd apache-iceberg-1.1.0:  OK
6. Run RAT checks to validate license headers: dev/check-license: OK
7. Build and test the project: ./gradlew build (use Java 8) :   OK

Thanks.

On Mon, Nov 28, 2022 at 3:05 AM Daniel Weeks  wrote:

> +1 (binding)
>
> Verified sigs/sums/licenses/build/test
> Built and tested with JDK 8
>
> -Dan
>
> On Fri, Nov 25, 2022 at 5:58 PM leilei hu  wrote:
>
>> +1(non-binding)
>> verified(java 8):
>>
>> - Create table using HiveCatalog and HadoopCatalog
>> - Spark Structured Streaming with Spark 3.2.1
>> - Spark query with Spark’s DataSourceV2 API
>> - Ran build with JDK8
>>
>> 2022年11月24日 上午12:13,Cheng Pan  写道:
>>
>> +1 (non-binding)
>>
>> Passed integration test[1] w/ Apache Kyuubi
>>
>> [1] https://github.com/apache/incubator-kyuubi/pull/3810
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On Nov 23, 2022 at 16:13:34, Ajantha Bhat  wrote:
>>
>>> +1 (non-binding)
>>>
>>> - verified tests against spark-3.3 runtime jar with Nessie catalog.
>>> - verified the contents of the iceberg-spark-runtime-3.3_2.12-1.1.0.jar
>>> - checked for spark-3.0 removal
>>> - validated checksum and signature
>>> - checked license docs & ran RAT checks
>>> - ran build with JDK1.8
>>>
>>> Thanks,
>>> Ajantha
>>>
>>> On Tue, Nov 22, 2022 at 9:49 PM Gabor Kaszab 
>>> wrote:
>>>
 Hi Everyone,

 I propose that we release the following RC as the official Apache Iceberg 
 1.1.0 release.

 The commit ID is ede085d0f7529f24acd0c81dd0a43f7bb969b763
 * This corresponds to the tag: apache-iceberg-1.1.0-rc4
 * https://github.com/apache/iceberg/commits/apache-iceberg-1.1.0-rc4
 * 
 https://github.com/apache/iceberg/tree/ede085d0f7529f24acd0c81dd0a43f7bb969b763

 The release tarball, signature, and checksums are here:
 * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-1.1.0-rc4

 You can find the KEYS file here:
 * https://dist.apache.org/repos/dist/dev/iceberg/KEYS

 Convenience binary artifacts are staged on Nexus. The Maven repository URL 
 is:
 * https://repository.apache.org/content/repositories/orgapacheiceberg-1114/

 Please download, verify, and test.

 Please vote in the next 72 hours.

 [ ] +1 Release this as Apache Iceberg 1.1.0
 [ ] +0
 [ ] -1 Do not release this because...


>>


Re: [Discuss]- Donate Iceberg Flink Connector

2022-11-08 Thread OpenInx
Hi

Sorry for the late reply. I'm one of the core flink iceberg connector
maintainers at the early stage (flink 1.12, flink 1.13, flink 1.14). For
the later flink releases,  I've had some adjustments in my work and had
less interactions with apache flink+iceberg, thanks Ryan, Steven, Kyle,
hililiwei for the great and hard work to move iceberg flink modules
forward.

When I learned that the flink and iceberg communities were discussing
moving the flink connector to an external unified flink connector repo, I
read the email list and the past iceberg community sync notes carefully,
and also talked with flink guys offline. I think I have basically
understood the context and pros & cons about moving the flink connector out
the iceberg repo.

To be honest, the workflow about flink external connectors repository is
not very mature and has less attention than flink/iceberg projects. Moving
the flink module from the iceberg repository at this point may not be the
right time to make things better. As we had discussed above,  the flink API
is gradually becoming more stable (moving forward to the better direction
at least), and the flink community has promised to invest more bandwidth
(as Jark Wu said) to maintain the flink module.  I'm wondering if we can
keep the flink connector in the iceberg repository for a while (say a
quarter or two quarter) to see if the status quo improves. If the
maintenance experience has been improved in a quarter or half a year, then
it should be the right choice for us to keep the flink module. After all,
this is beneficial to both the iceberg and flink. Otherwise, I think we
should really move the flink module away until the API becomes stable
enough ( if we have to do the moving, I think that flink externalized
connectors repo will be more mature at that time, and it is more
appropriate to remove the flink module ).

Thanks.

On Tue, Oct 25, 2022 at 2:03 PM Jark Wu  wrote:

> Hi Ryan,
>
> Thanks for your input.
>
> I think the Flink Connector API is relatively stable now, compared to the
> previous versions.
> We have verified the latest Iceberg connector with the upcoming 1.16
> release, and it works well.
> I think API stability is something for the future and we should have some
> workflow or mechanism
> to guarantee this from an external connector side.
>
> We will come up with a proposal about the API compatibility guarantee
> workflow/mechanism and
> a best practice + PoC for multi-version support. We are willing to join
> the Iceberg community to
> improve/refactor the connector and deliver a better experience of the
> connector for users.
>
> How about holding the voting a bit and waiting until we have a conclusion
> about the discussion?
>
> Best,
> Jark
>
> On Tue, 25 Oct 2022 at 03:55, Ryan Blue  wrote:
>
>> I don't think we want to talk about the Flink community accepting the
>> Iceberg connector just yet. The goal of Abid's exploration is to see
>> what it would look like as an external connector. We'd need to decide
>> in the Iceberg community if that's something that we'd want to do long
>> term. If it were me, I'd probably say wait until the connector APIs
>> are stable and there is a best practice for releasing.
>>
>> Ryan
>>
>> On Mon, Oct 24, 2022 at 11:16 AM Martijn Visser
>>  wrote:
>> >
>> > Hi all,
>> >
>> > There are many valid points raised in this discussion thread, but I
>> think we should not mix up different topics. From my perspective, there's
>> two things ongoing:
>> >
>> > 1. This thread is about the Flink community accepting the Iceberg
>> connector, with various maintainers from Iceberg volunteering to help with
>> the maintenance of the connector itself.
>> > 2. Also included in this thread are discussions about the
>> externalization of connectors from Flink. There have been recent
>> discussions on this [1] and there is engineering activity happening on that
>> topic and it is a big focus point for the next couple weeks/months. With
>> regards to seeing different opinions, I actually don't see those on the
>> mailing list because after the discussions, voting is passing.
>> >
>> > Best regards,
>> >
>> > Martijn
>> >
>> > [1]
>> https://cwiki.apache.org/confluence/display/FLINK/Externalized+Connector+development
>> >
>> > On Fri, Oct 21, 2022 at 3:01 AM Jark Wu  wrote:
>> >>
>> >> Hi Abid and all,
>> >>
>> >> I added the Iceberg dev community for a wider discussion.
>> >>
>> >> I agree with Yuxia and have the same concern as Steven Wu.
>> >>
>> >> There were long discussions around the externalizing connector and many
>> >> different opinions.
>> >> If I remember correctly[1][2], at last, we would like to externalize
>> >> ElasticSearch as an example,
>> >> and see how it works and what we can standardize (e.g., docs, releases,
>> >> versions, CI).
>> >> When everything works well, we can externalize other connectors.
>> >>
>> >> However, from what I see, currently, the externalized ElasticSearch
>> >> connector
>> >> is still at an early stage 

Re: [VOTE] Release Apache Iceberg 0.14.1 RC3

2022-09-05 Thread OpenInx
+1 (binding).


1. Download the source tarball, signature (.asc), and checksum (.sha512): OK

2. Import gpg keys: download KEYS and run  (optional if this hasn’t changed)

```bash
$ gpg --import /path/to/downloaded/KEYS
```

It's OK

3. Verify the signature by running:

```bash
$ gpg --verify apache-iceberg-0.14.1.tar.gz.asc.txt
gpg: no signed data
gpg: can't hash datafile: No data
```

*Question*: Should we publish the signature file with the correct name
apache-iceberg-0.14.1.tar.gz.asc (instead of appending the unknown .txt
suffix) ?

I renamed the apache-iceberg-0.14.1.tar.gz.asc.txt to
apache-iceberg-0.14.1.tar.gz.asc, and the verification works fine.

```bash
$ gpg --verify apache-iceberg-0.14.1.tar.gz.asc
gpg: assuming signed data in  'apache-iceberg-0.14.1.tar.gz'
gpg: Signature made 日 9/ 4 05:51:05 2022 CST
gpg: using RSA key D21CFB9BDBC379681261C7A086781D4FA4B2E9B5
gpg: Good signature from "Ryan Blue (CODE SIGNING KEY) "
[unknown]
gpg: aka "Ryan Blue " [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner.
Primary key fingerprint: C9A1 1D83 5A3C 1E5A 7F23 D247 FCB3 CBD9 D392 4CCD
Subkey fingerprint: D21C FB9B DBC3 7968 1261 C7A0 8678 1D4F A4B2 E9B5
```

4. Verify the checksum by running:

```bash
$ shasum -a 256 -c apache-iceberg-0.14.1.tar.gz.sha512.txt
apache-iceberg-0.14.1.tar.gz
apache-iceberg-0.14.1.tar.gz: OK
```

5. Untar the archive and go into the source directory:

```bash
$ tar xzf apache-iceberg-0.14.1.tar.gz && cd apache-iceberg-0.14.1
```
It's OK.

6. Run RAT checks to validate license headers: dev/check-license: OK
7. Build and test the project: ./gradlew build (use Zulu Java 8) : OK



On Mon, Sep 5, 2022 at 4:37 PM leilei hu  wrote:

> +1(non-binding)
> verified(java 8):
>
> 1. Create table using HiveCatalog and HadoopCatalog
> 2. Spark Structured Streaming
> 3. Spark query with Spark’s DataSourceV2 API
> 4. Checksum, build and test
>
> 2022年9月5日 上午1:46,Daniel Weeks  写道:
>
> +1 (binding)
>
> Verified: sigs, sums, license, build and test (java 8)
>
> -Dan
>
> On Sat, Sep 3, 2022 at 6:17 PM Ryan Blue  wrote:
>
>> +1 (binding)
>>
>>- Checked signature and checksum
>>- Ran RAT checks
>>- Tested with Java 11
>>
>> Ryan
>>
>> On Sat, Sep 3, 2022 at 5:41 PM Ryan Blue  wrote:
>>
>>> Hi Everyone,
>>>
>>> I propose that we release the following RC as the official Apache
>>> Iceberg 0.14.1 release.
>>>
>>> The commit ID is 71d918e781eff70c2c2a21aea7289daad61c8afe
>>> * This corresponds to the tag: apache-iceberg-0.14.1-rc3
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.14.1-rc3
>>> *
>>> https://github.com/apache/iceberg/tree/71d918e781eff70c2c2a21aea7289daad61c8afe
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.14.1-rc3
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged on Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1105/
>>>
>>> The 0.14.1 milestone tracks the bugs that are fixed in this release:
>>> * https://github.com/apache/iceberg/milestone/21?closed=1
>>>
>>> Notable fixes include:
>>> * #5683 - Core: Fix exception handling in BaseTaskWriter (Flink's double
>>> close problem)
>>> * #5437 - Core, AWS: Fix Kryo serialization failure for FileIO
>>> * #5681 - Parquet: Close zstd input stream early to avoid memory pressure
>>> * #5691 - Spark: Fix stats in rewrite metadata action after partitioning
>>> changes
>>>
>>> Please download, verify, and test.
>>>
>>> This vote will be open for the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.14.1
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>> --
>> Ryan Blue
>>
>
>


Re: 【Feature】Request support for c++ sdk

2022-06-13 Thread OpenInx
Thanks Kyle for sharing your context.

Recently, I also spent some time practicing my Rust skills.  Generally,
I'm +1 for adding Rust SDK support for native language.


On Mon, Jun 13, 2022 at 12:51 PM Kyle Bendickson  wrote:

> Thanks for starting this discussion.
>
> I know I was the first to mention some of my concerns (which I still have
> and would apply to any new major change), but I also think that this is an
> avenue that should be explored.
>
> Specifically a native integration would have many benefits for read paths
> (in addition to others). I know that the Rust avro reader is
> significantly faster, as well as native columnar formats.
>
> So while I do have some concerns about making sure we have enough people
> to support this endeavor, I do want to say I think it's a really good idea.
> My apologies if I gave the impression otherwise.
>
> I would personally be interested in contributing to and reviewing for a
> native Rust library (or CPP, but I think Rust is a much more elegant
> language and I'd personally prefer to work in that as it's easier to work
> with across systems than C++ imo though I would defer to others on that).
>
> I would also be happy to offer my help and perspective in moving this
> forward if need be. But I did want to express my practical concerns so that
> we don't have an area of the codebase where there aren't enough people to
> help maintain it etc.
>
> But in general I think this is an exciting opportunity, and results have
> shown time and time again that native readers / writers are much more
> performant.
>
> +1 to using Rust as well (which is a language I know more of than C++
> these days - though both I'd have to brush off my skillset).
>
> Best, Kyle
>
> On Sun, Jun 12, 2022 at 8:20 PM OpenInx  wrote:
>
>> Hi Tao Wu.
>>
>> I think the apache iceberg community is very consistent in providing the
>> Iceberg SDK for native languages.  I am very happy to offer my perspective
>> and help if needed when you try to move this thing forward.
>>
>> On Mon, Jun 13, 2022 at 11:04 AM Wu Tao  wrote:
>>
>>> Hi, everyone, I'm Tao. I'm currently working on a commercial streaming
>>> system that is written in Rust.
>>>
>>> Actually, I'm planning to implement an Iceberg Rust SDK so that we can
>>> have better integration with the existing Iceberg ecosystem. Initially I
>>> found https://github.com/oliverdaff/iceberg-rs, but it appears the
>>> author hasn't been active lately. So I'm looking to see if the Iceberg
>>> community has any consensus on a Rust/C++ SDK (Rust is preferable), and if
>>> there is, we'd love to contribute. I believe as Iceberg increases its
>>> popularity, there will eventually be more systems that want such libraries.
>>> There could have even been some ongoing works without consulting with the
>>> community.
>>>
>>> Additionally, I think the initial Rust/C++ SDK can only support the
>>> reader sides of Iceberg. Because there have been plenty of JVM-based
>>> query engines out there taking charge of data maintenance. We don't have to
>>> rewrite every corner of Iceberg in Rust. That means less engineering work.
>>>
>>> On 2022/06/08 10:16:05 OpenInx wrote:
>>> > As a cloud-native table format standard for the big-data ecosystem,  I
>>> > believe supporting multiple languages is the correct direction so that
>>> > different languages can connect to the apache iceberg table format.
>>> >
>>> > But I can also get Kyle's point about lacking enough
>>> resources(developers
>>> > and reviewers ) to accomplish this goal.  In my mind,  Python, Golang,
>>> C++,
>>> > Rust , all of them can be regarded as the native language support.  we
>>> may
>>> > just need to support the Rust SDK and then all of the other languages
>>> can
>>> > just wrap the Rust SDK to access the table format.
>>> >
>>> > Anyway,  we will need to wait for the REST catalog finished before we
>>> > introduce another languages support , because we can not access the
>>> iceberg
>>> > table by invoking the JVM catalog interfaces.
>>> >
>>> > On Tue, Jun 7, 2022 at 4:41 AM Micah Kornfield 
>>> > wrote:
>>> >
>>> > > There’s also the question of how useful this would be in practice
>>> given
>>> > >> the complexity of using C++ (or Rust etc) within some of the major
>>> > >> frameworks.
>>> > >>
>>> > >
>>> > > One place th

Re: 【Feature】Request support for c++ sdk

2022-06-12 Thread OpenInx
Hi Tao Wu.

I think the apache iceberg community is very consistent in providing the
Iceberg SDK for native languages.  I am very happy to offer my perspective
and help if needed when you try to move this thing forward.

On Mon, Jun 13, 2022 at 11:04 AM Wu Tao  wrote:

> Hi, everyone, I'm Tao. I'm currently working on a commercial streaming
> system that is written in Rust.
>
> Actually, I'm planning to implement an Iceberg Rust SDK so that we can
> have better integration with the existing Iceberg ecosystem. Initially I
> found https://github.com/oliverdaff/iceberg-rs, but it appears the author
> hasn't been active lately. So I'm looking to see if the Iceberg community
> has any consensus on a Rust/C++ SDK (Rust is preferable), and if there is,
> we'd love to contribute. I believe as Iceberg increases its popularity,
> there will eventually be more systems that want such libraries. There could
> have even been some ongoing works without consulting with the community.
>
> Additionally, I think the initial Rust/C++ SDK can only support the
> reader sides of Iceberg. Because there have been plenty of JVM-based
> query engines out there taking charge of data maintenance. We don't have to
> rewrite every corner of Iceberg in Rust. That means less engineering work.
>
> On 2022/06/08 10:16:05 OpenInx wrote:
> > As a cloud-native table format standard for the big-data ecosystem,  I
> > believe supporting multiple languages is the correct direction so that
> > different languages can connect to the apache iceberg table format.
> >
> > But I can also get Kyle's point about lacking enough resources(developers
> > and reviewers ) to accomplish this goal.  In my mind,  Python, Golang,
> C++,
> > Rust , all of them can be regarded as the native language support.  we
> may
> > just need to support the Rust SDK and then all of the other languages can
> > just wrap the Rust SDK to access the table format.
> >
> > Anyway,  we will need to wait for the REST catalog finished before we
> > introduce another languages support , because we can not access the
> iceberg
> > table by invoking the JVM catalog interfaces.
> >
> > On Tue, Jun 7, 2022 at 4:41 AM Micah Kornfield 
> > wrote:
> >
> > > There’s also the question of how useful this would be in practice given
> > >> the complexity of using C++ (or Rust etc) within some of the major
> > >> frameworks.
> > >>
> > >
> > > One place this would be useful is for the Arrow's DataSet API [1].  An
> > > option the Arrow community might be open to is hosting parts of the
> code
> > > there (this is what is done for Apache Parquet C++).  This helps shape
> some
> > > of the answers to other questions posed (ORC and Parquet are already
> in the
> > > Repo, it provides a Filesystem interface, etc).  The project doesn't
> > > currently consume Avro, and I think the preferred approach is to make a
> > > clean room Avro parser.  But I agree this is a non-trivial effort to
> get
> > > underway.
> > >
> > > Another area to consider is compatibility testing.  I think before a
> third
> > > officially supported community library is introduced it would be good
> to
> > > have a compatibility framework in place to make sure implementations
> are
> > > all interpreting the specification correctly.  If there isn't already
> an
> > > effort here, I'd like to start contributing something (probably will
> have
> > > bandwidth sometime place in Q3).
> > >
> > > Thanks,
> > > -Micah
> > >
> > >
> > > [1] https://arrow.apache.org/docs/cpp/dataset.html
> > >
> > > On Sun, Jun 5, 2022 at 11:07 PM Kyle Bendickson 
> wrote:
> > >
> > >> Hi caneGuy,
> > >>
> > >> I personally don’t dislike this idea. I understand the performance
> > >> benefits.
> > >>
> > >> But this would be a huge undertaking for the community. We’d need to
> > >> ensure we had sufficient developer support for reviews (likely one of
> the
> > >> biggest issues), as well as a number of other things. Particularly
> > >> dependencies, package management, etc. We’d also need to scope
> support down
> > >> to specific OS / compilers etc.
> > >>
> > >> We’d also need to be sure we had adequate developer support from a
> wide
> > >> enough range of the community to support the project long term. One
> issue
> > >> in open source is that developers will work on something tangential to
> > &g

Re: 【Feature】Request support for c++ sdk

2022-06-08 Thread OpenInx
As a cloud-native table format standard for the big-data ecosystem,  I
believe supporting multiple languages is the correct direction so that
different languages can connect to the apache iceberg table format.

But I can also get Kyle's point about lacking enough resources(developers
and reviewers ) to accomplish this goal.  In my mind,  Python, Golang, C++,
Rust , all of them can be regarded as the native language support.  we may
just need to support the Rust SDK and then all of the other languages can
just wrap the Rust SDK to access the table format.

Anyway,  we will need to wait for the REST catalog finished before we
introduce another languages support , because we can not access the iceberg
table by invoking the JVM catalog interfaces.

On Tue, Jun 7, 2022 at 4:41 AM Micah Kornfield 
wrote:

> There’s also the question of how useful this would be in practice given
>> the complexity of using C++ (or Rust etc) within some of the major
>> frameworks.
>>
>
> One place this would be useful is for the Arrow's DataSet API [1].  An
> option the Arrow community might be open to is hosting parts of the code
> there (this is what is done for Apache Parquet C++).  This helps shape some
> of the answers to other questions posed (ORC and Parquet are already in the
> Repo, it provides a Filesystem interface, etc).  The project doesn't
> currently consume Avro, and I think the preferred approach is to make a
> clean room Avro parser.  But I agree this is a non-trivial effort to get
> underway.
>
> Another area to consider is compatibility testing.  I think before a third
> officially supported community library is introduced it would be good to
> have a compatibility framework in place to make sure implementations are
> all interpreting the specification correctly.  If there isn't already an
> effort here, I'd like to start contributing something (probably will have
> bandwidth sometime place in Q3).
>
> Thanks,
> -Micah
>
>
> [1] https://arrow.apache.org/docs/cpp/dataset.html
>
> On Sun, Jun 5, 2022 at 11:07 PM Kyle Bendickson  wrote:
>
>> Hi caneGuy,
>>
>> I personally don’t dislike this idea. I understand the performance
>> benefits.
>>
>> But this would be a huge undertaking for the community. We’d need to
>> ensure we had sufficient developer support for reviews (likely one of the
>> biggest issues), as well as a number of other things. Particularly
>> dependencies, package management, etc. We’d also need to scope support down
>> to specific OS / compilers etc.
>>
>> We’d also need to be sure we had adequate developer support from a wide
>> enough range of the community to support the project long term. One issue
>> in open source is that developers will work on something tangential to
>> their project in another repository, but nobody is available to maintain it.
>>
>> There’s also the question of how useful this would be in practice given
>> the complexity of using C++ (or Rust etc) within some of the major
>> frameworks.
>>
>> Again, I’m not opposed to the idea but just trying to be realistic about
>> the realities of such an undertaking. It would need full community support
>> (or at least support from enough community members to be sustainable).
>>
>> If you wanted to make a design doc, the milestones tab in the Iceberg
>> project has some that you might use as reference.
>>
>> *I highly suggest you come to the next community sync and bring this up
>> to the community then.*
>>
>> If you’re not already on the invite list for the monthly community sync,
>> you can get on it by joining the Google group. You’ll receive incites when
>> they go out:
>> https://groups.google.com/g/iceberg-sync
>>
>> Looking forward to seeing you at the next community sync.
>>
>> A design document and/or any prior art would be very helpful as the
>> community sync does discuss many topics (possibly there is existing C++
>> support in StarRocks for Iceberg V1?).
>>
>> Thank you,
>> Kyle Bendickson
>> GitHub: kbendick
>>
>> On Sun, Jun 5, 2022 at 10:44 PM Sam Redai  wrote:
>>
>>> Currently there is no existing effort to develop a C++ package. That
>>> being said I think it would be awesome to have one! If anyone is willing to
>>> start that development effort, I can help with some of the ground work to
>>> kickstart it.
>>>
>>> I would say the first step would be for someone to prepare a high-level
>>> proposal.
>>>
>>> -Sam
>>>
>>> On Sun, Jun 5, 2022 at 11:02 PM 周康  wrote:
>>>
 Hi team
 I am a dev from StarRocks community, and we have supported iceberg v1
 format.
 We are also planning to support v2 format. If there is a C++ package,
 it will be very convenient for our implementation.
 At the same time, other c++ computing engines support v2 format will
 also be faster.

 Do we have plans to support c++ version sdk?
 --
 caneGuy

>>> --
>>>
>>> Sam Redai 
>>>
>>> Developer Advocate  |  Tabular 
>>>
>>> c (267) 226-8606
>>>
>>


Re: Iceberg Delete Compaction Interface Design

2022-04-20 Thread OpenInx
Hi Yufei

There was a proposed PR for this :
https://github.com/apache/iceberg/pull/4522

On Thu, Apr 21, 2022 at 5:42 AM Yufei Gu  wrote:

> Hi team,
>
> Do we have a PR for this type of delete compaction?
>
>> Merge: the changes specified in delete files are applied to data files
>> and then overwrite the original data file, e.g. merging delete files to
>> data files.
>
>
>
> Yufei
>
>
>
> On Wed, Nov 3, 2021 at 8:29 AM Puneet Zaroo 
> wrote:
>
>> Sounds great. I will look at the PRs.
>> thanks,
>>
>> On Tue, Nov 2, 2021 at 11:35 PM Jack Ye  wrote:
>>
>>>
>>> Yes I am actually arriving at exactly the same conclusion as you just
>>> now. I was focusing on the immediate removal of delete files too much when
>>> writing the doc and lost this aspect that we don't need to remove the
>>> deletes after having the functionality to preserve sequence number.
>>>
>>> I just published https://github.com/apache/iceberg/pull/3454 to add the
>>> option for selecting based on deletes in BinPackStrategy this afternoon, I
>>> will add another PR that preserves the sequence number tomorrow.
>>>
>>> -Jack
>>>
>>> On Tue, Nov 2, 2021 at 11:23 PM Puneet Zaroo 
>>> wrote:
>>>
 Thanks for further clarifications, and outlining the detailed steps for
 the delete or 'MERGE' compaction. It seems this compaction is explicitly
 geared towards removing delete files. While that may be useful; I feel for
 CDC tables doing the Bin-pack and Sorting compactions and *removing
 the NEED for reading delete files in downstream queries * quickly
 without conflicts with concurrent CDC updates is more important. This
 guarantees that downstream query performance is decent soon after the data
 has landed in the table.
 The actual delete file removal can happen in a somewhat delayed manner
 as well; as that is now just a storage optimization as those delete files
 are no longer accessed in the query path.

 The above requirements can be achieved if the output of the current
 sorting and bin-pack actions also set the sequence number to the sequence
 number of the snapshot with which the compaction was started. And of-course
 the file selection criteria has to be extended to also pick files which
 have > a threshold number of delete files associated with them, in addition
 to the criteria already used (incorrect file size for bin pack or range
 overlap for sort).

 Thanks,
 - Puneet

 On Tue, Nov 2, 2021 at 7:33 PM Jack Ye  wrote:

> > I think even with the custom sequence file numbers on output data
> files; the position delete files have to be deleted; *since position
> deletes also apply on data files with the same sequence number*.
> Also, unless I am missing something, I think the equality delete files
> cannot be deleted at the end of this compaction, as it is really hard to
> figure out if all the impacted data files have been rewritten:
>
> The plan is to always remove both position and equality deletes. Given
> a predicate (e.g. COMPACT table WHERE region='US'), the initial
> implementation will always compact full partitions by (1) look for all
> delete files based on the predicate, (2) get all impacted partitions, (3)
> rewrite all data files in those partitions that has deletes, (4) remove
> those delete files. The algorithm can be improved to a smaller subset of
> files, but currently we mostly rely on Iceberg's scan planning and as you
> said it's hard to figure out if a delete file (especially equality delete)
> covers any additional data files. But we know each delete file only 
> belongs
> to a single partition, which guarantees the removal is safe. (For
> partitioned tables, global deletes will be handled separately and not
> removed unless specifically requested by the user because it requires a
> full table compact, but CDC does not write global deletes anyway)
>
> -Jack
>
>
>
> On Tue, Nov 2, 2021 at 4:10 PM Puneet Zaroo 
> wrote:
>
>> Thanks for the clarifications; and thanks for pulling together the
>> documentation for the row-level delete functionality separately; as that
>> will be very helpful.
>> I think we are in agreement on most points. I just want to reiterate
>> my understanding of the merge compaction behavior to make sure we are on
>> the same page.
>>
>> The output table of a Flink CDC pipeline will in a lot of cases have
>> small files with unsorted data; so doing the bin-pack and sorting
>> compactions is also important for those tables (and obviously being able 
>> to
>> do so without conflicts with incoming data is also important). If the
>> existing bin-pack and sort compaction actions are enhanced with
>>
>> 1) writing the output data files with the sequence number of the
>> snapshot with which the compaction was started with *AND *

Re: Welcome Szehon Ho as a committer!

2022-03-11 Thread OpenInx
Congrats Szehon!

On Sat, Mar 12, 2022 at 7:55 AM Steve Zhang 
wrote:

> Congratulations Szehon, Well done!
>
> Thanks,
> Steve Zhang
>
>
>
> On Mar 11, 2022, at 3:51 PM, Jack Ye  wrote:
>
> Congratulations Szehon!!
>
> -Jack
>
> On Fri, Mar 11, 2022 at 3:45 PM Wing Yew Poon 
> wrote:
>
>> Congratulations Szehon!
>>
>>
>> On Fri, Mar 11, 2022 at 3:42 PM Sam Redai  wrote:
>>
>>> Congrats Szehon!
>>>
>>> On Fri, Mar 11, 2022 at 6:41 PM Yufei Gu  wrote:
>>>
 Congratulations Szehon!
 Best,

 Yufei

 `This is not a contribution`


 On Fri, Mar 11, 2022 at 3:36 PM Ryan Blue  wrote:

> Congratulations Szehon!
>
> Sorry I accidentally preempted this announcement with the board report!
>
> On Fri, Mar 11, 2022 at 3:32 PM Anton Okolnychyi <
> aokolnyc...@apple.com.invalid> wrote:
>
>> Hey everyone,
>>
>> I would like to welcome Szehon Ho as a new committer to the project!
>>
>> Thanks for all your work, Szehon!
>>
>> - Anton
>>
>
>
> --
> Ryan Blue
> Tabular
>

>


Review Request

2022-03-09 Thread OpenInx
Hi iceberg dev

I've recently revisited the flink write path to use the newly introduced
writers (which is partition specific writers).  All the future performance
& stability optimization will be made on top of the revisited flink write
path.  I've just published the PR here:
https://github.com/apache/iceberg/pull/4264

Could anybody have a chance to help to review this PR ?  Thanks in advance.


Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-07 Thread OpenInx
Thanks Dongjoon & Yiqun for the quick PR for adding the `estimateMemory`
API.

Also thanks Yiqun & Owen for your points,  I think you are right.  So
a more accurate estimation method may be to multiply batch.size by the
average width of the data type, and then multiply it by the compression
rate, which is usually an empirical value.



On Sat, Mar 5, 2022 at 2:28 AM Owen O'Malley  wrote:

> At the stripe boundaries, the bytes on disk statistics are accurate. A
> stripe that is in flight, is going to be an estimate, because the
> dictionaries can't be compressed until the stripe is flushed. The memory
> usage will be a significant over estimate, because it includes buffers that
> are allocated, but not used yet.
>
> .. Owen
>
> On Fri, Mar 4, 2022 at 5:23 PM Dongjoon Hyun  wrote:
>
>> The following is merged for Apache ORC 1.7.4.
>>
>> ORC-1123 Add `estimationMemory` method for writer
>>
>> According to the Apache ORC milestone, it will be released on May 15th.
>>
>> https://github.com/apache/orc/milestones
>>
>> Bests,
>> Dongjoon.
>>
>> On 2022/03/04 13:11:15 Yiqun Zhang wrote:
>> > Hi Openinx
>> >
>> > Thank you for initiating this discussion. I think we can get the
>> `TypeDescription` from the writer and in the `TypeDescription` we know
>> which types and more precisely the maximum length of the varchar/char. This
>> will help us to estimate the average width.
>> >
>> > Also, I agree with your suggestion, I will make a PR later to add the
>> `estimateMemory` public method for Writer.
>> >
>> > On 2022/03/04 04:01:04 OpenInx wrote:
>> > > Hi Iceberg dev
>> > >
>> > > As we all know,  in our current apache iceberg write path,  the ORC
>> file
>> > > writer cannot just roll over to a new file once its byte size reaches
>> the
>> > > expected threshold.  The core reason that we don't support this
>> before is:
>> > >   The lack of correct approach to estimate the byte size from an
>> unclosed
>> > > ORC writer.
>> > >
>> > > In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
>> trying
>> > > to propose an estimate approach to fix this fundamentally (Also
>> enabled all
>> > > those ORC writer unit tests that we disabled intentionally before).
>> > >
>> > > The approach is:  If a file is still unclosed , let's estimate its
>> size in
>> > > three steps ( PR:
>> > >
>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
>> > > )
>> > >
>> > > 1. Size of data that has been written to stripe.The value is obtained
>> by
>> > > summing the offset and length of the last stripe of the writer.
>> > > 2. Size of data that has been submitted to the writer but has not been
>> > > written to the stripe. When creating OrcFileAppender, treeWriter is
>> > > obtained through reflection, and uses its estimateMemory to estimate
>> how
>> > > much memory is being used.
>> > > 3. Data that has not been submitted to the writer, that is, the size
>> of the
>> > > buffer. The maximum default value of the buffer is used here.
>> > >
>> > > My feeling is:
>> > >
>> > > For the file-persisted bytes , I think using the last strip's offset
>> plus
>> > > its length should be correct. For the memory encoded batch vector ,
>> the
>> > > TreeWriter#estimateMemory should be okay.
>> > > But for the batch vector whose rows did not flush to encoded memory,
>> using
>> > > the batch.size shouldn't be correct. Because the rows can be any data
>> type,
>> > > such as Integer, Long, Timestamp, String etc. As their widths are not
>> the
>> > > same, I think we may need to use an average width minus the batch.size
>> > > (which is row count actually).
>> > >
>> > > Another thing is about the `TreeWriter#estimateMemory` method,  The
>> current
>> > > `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
>> > > `estimateMemory` method to public,  I will suggest to publish a PR to
>> > > apache ORC project to expose those interfaces in
>> `org.apache.orc.Writer` (
>> > > see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>> > >
>> > > I'd like to invite the iceberg dev to evaluate the current approach.
>> Is
>> > > there any other concern from the ORC experts' side ?
>> > >
>> > > Thanks.
>> > >
>> >
>>
>


Re: [DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
> As their widths are not the same, I think we may need to use an average
width minus the batch.size (which is row count actually).

@Kyle, sorry I miss-typed the word before.  I mean "need an average width
multiplied by the batch.size".

On Fri, Mar 4, 2022 at 1:29 PM liwei li  wrote:

> Thanks to openinx for opening this discussion.
>
> One thing to note, the current approach faces a problem, because of some
> optimization mechanisms, when writing a large amount of duplicate data,
> there will be some deviation between the estimated and the actual size.
> However, when cached data is flushed (the amount or size exceeds the
> threshold), the estimate is revised. That's one of the reasons I changed
> this
> https://github.com/apache/iceberg/pull/3784#discussion_r819229491
>
>
> Kyle Bendickson 于2022年3月4日 周五12:55写道:
>
>> Hi Openinx.
>>
>> Thanks for bringing this to our attention. And many thanks to hiliwei for
>> their willingness to tackle big problems and little problems.
>>
>> I wanted to say that I think most anything that’s relatively close would
>> be better than the current situation most likely (where the feature is
>> disabled entirely).
>>
>> Thank you for your succinct summary of the situation. I tagged Dongjoon
>> Hyun, one of the ORC VPs, in the PR and will reach out to him as well.
>>
>> I am inclined to agree that we need to consider the width of the types,
>> as fields like binary or even string can be potentially quite wide compared
>> to int.
>>
>> I like your suggestion to use an “average width” when used with the batch
>> size, though subtracting batch size from average width seems slightly off…
>> I would think maybe the average width needs to be multiples or divided with
>> the batch size. Possibly I’m not understanding fully.
>>
>> How would you propose to get an “average width”, for use with the data
>> that’s not been flushed to disk yet? And would it be an average width based
>> on the actually observed data or just on the types?
>>
>> Again, I think that any approach is better than none, and we can iterate
>> on the statistics collection. But I am inclined to agree, points (1) and
>> (2) seem ok. And it would be beneficial to consider the points raised
>> regarding (3).
>>
>> Thanks for bringing this to the dev list.
>>
>> And many thanks to hiliwei for their work so far!
>>
>> - Kyle
>>
>> On Thu, Mar 3, 2022 at 8:01 PM OpenInx  wrote:
>>
>>> Hi Iceberg dev
>>>
>>> As we all know,  in our current apache iceberg write path,  the ORC file
>>> writer cannot just roll over to a new file once its byte size reaches the
>>> expected threshold.  The core reason that we don't support this before is:
>>>   The lack of correct approach to estimate the byte size from an unclosed
>>> ORC writer.
>>>
>>> In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is
>>> trying to propose an estimate approach to fix this fundamentally (Also
>>> enabled all those ORC writer unit tests that we disabled intentionally
>>> before).
>>>
>>> The approach is:  If a file is still unclosed , let's estimate its size
>>> in three steps ( PR:
>>> https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
>>> )
>>>
>>> 1. Size of data that has been written to stripe.The value is obtained by
>>> summing the offset and length of the last stripe of the writer.
>>> 2. Size of data that has been submitted to the writer but has not been
>>> written to the stripe. When creating OrcFileAppender, treeWriter is
>>> obtained through reflection, and uses its estimateMemory to estimate how
>>> much memory is being used.
>>> 3. Data that has not been submitted to the writer, that is, the size of
>>> the buffer. The maximum default value of the buffer is used here.
>>>
>>> My feeling is:
>>>
>>> For the file-persisted bytes , I think using the last strip's offset
>>> plus its length should be correct. For the memory encoded batch vector ,
>>> the TreeWriter#estimateMemory should be okay.
>>> But for the batch vector whose rows did not flush to encoded memory,
>>> using the batch.size shouldn't be correct. Because the rows can be any data
>>> type, such as Integer, Long, Timestamp, String etc. As their widths are not
>>> the same, I think we may need to use an average width minus the batch.size
>>> (which is row count actually).
>>>
>>> Another thing is about the `TreeWriter#estimateMemory` method,  The
>>> current `org.apache.orc.Writer`  don't expose the `TreeWriter` field or
>>> `estimateMemory` method to public,  I will suggest to publish a PR to
>>> apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
>>> see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )
>>>
>>> I'd like to invite the iceberg dev to evaluate the current approach.  Is
>>> there any other concern from the ORC experts' side ?
>>>
>>> Thanks.
>>>
>>


[DISCUSS] The correct approach to estimate the byte size for an unclosed ORC writer.

2022-03-03 Thread OpenInx
Hi Iceberg dev

As we all know,  in our current apache iceberg write path,  the ORC file
writer cannot just roll over to a new file once its byte size reaches the
expected threshold.  The core reason that we don't support this before is:
  The lack of correct approach to estimate the byte size from an unclosed
ORC writer.

In this PR: https://github.com/apache/iceberg/pull/3784,  hiliwei is trying
to propose an estimate approach to fix this fundamentally (Also enabled all
those ORC writer unit tests that we disabled intentionally before).

The approach is:  If a file is still unclosed , let's estimate its size in
three steps ( PR:
https://github.com/apache/iceberg/pull/3784/files#diff-e7fcc622bb5551f5158e35bd0e929e6eeec73717d1a01465eaa691ed098af3c0R107
)

1. Size of data that has been written to stripe.The value is obtained by
summing the offset and length of the last stripe of the writer.
2. Size of data that has been submitted to the writer but has not been
written to the stripe. When creating OrcFileAppender, treeWriter is
obtained through reflection, and uses its estimateMemory to estimate how
much memory is being used.
3. Data that has not been submitted to the writer, that is, the size of the
buffer. The maximum default value of the buffer is used here.

My feeling is:

For the file-persisted bytes , I think using the last strip's offset plus
its length should be correct. For the memory encoded batch vector , the
TreeWriter#estimateMemory should be okay.
But for the batch vector whose rows did not flush to encoded memory, using
the batch.size shouldn't be correct. Because the rows can be any data type,
such as Integer, Long, Timestamp, String etc. As their widths are not the
same, I think we may need to use an average width minus the batch.size
(which is row count actually).

Another thing is about the `TreeWriter#estimateMemory` method,  The current
`org.apache.orc.Writer`  don't expose the `TreeWriter` field or
`estimateMemory` method to public,  I will suggest to publish a PR to
apache ORC project to expose those interfaces in `org.apache.orc.Writer` (
see: https://github.com/apache/iceberg/pull/3784/files#r819238427 )

I'd like to invite the iceberg dev to evaluate the current approach.  Is
there any other concern from the ORC experts' side ?

Thanks.


Re: Review request

2022-03-02 Thread OpenInx
Thanks Peter for the great work. Just added my comments.

On Wed, Mar 2, 2022 at 4:20 PM Peter Vary 
wrote:

> Hi Team,
>
> I have a PR (https://github.com/apache/iceberg/pull/4218) waiting for
> review where with basically a 1 liner change we can improve the performance
> of the GenericReader classes by 10–20%. This onliner is needed in 3 places.
> The other part of the PR is only the added tests.
>
> I would love to see some committer reviews on this if your time permits.
>
> Thanks,
> Peter
>


Re: [DISCUSS] Align the spark runtime artifact names among spark2.4, spark3.0, spark3.1 and spark3.2

2022-02-24 Thread OpenInx
So we basically  agree to rename the spark artifact names.  Is there any
other concern for this PR: https://github.com/apache/iceberg/pull/4158/ ?

On Wed, Feb 23, 2022 at 1:48 AM Ryan Blue  wrote:

> I initially supported not renaming for the reason that Jeff raised, but
> now I'm more convinced by Kyle's argument. This is confusing and it isn't
> that big of a problem to use a different Jar. +1 to renaming.
>
> On Sun, Feb 20, 2022 at 10:57 PM Yufei Gu  wrote:
>
>> Agreed with Kyle. An artifact name of Spark3.0 like
>> iceberg-spark-runtime-3.0_2.12-0.13.1.jar is more accurate and
>> consistent,  less confusing for users.
>>
>> On Sun, Feb 20, 2022 at 10:47 PM Kyle Bendickson  wrote:
>>
>>> Thanks for bringing this up Jeff!
>>>
>>> Normally I agree, it’s not a good practice to change artifact name.
>>> However, in this case, the artifact has changed already. The
>>> “spark3-runtime” used to be for all versions of Spark 3 (at the time Spark
>>> 3.0 and 3.1). It no longer is, as it’s only tested / used with Spark 3.0.
>>>
>>> I encounter many users who have upgraded to newer versions of Spark, but
>>> have not upgraded the artifact to the newly versioned by Spark name system
>>> as “spark3-runtime” sounds like it encompasses all versions. And they
>>> encounter subtle bugs and it’s not a great user experience to solve
>>> upgrading that way.
>>>
>>> These users are, however, updating the Iceberg artifact to the new
>>> versions.
>>>
>>> So I think in this case, breaking naming has benefits. As users who go
>>> to upgrade when new Iceberg version are released, and their dependency is
>>> not found, they will hopefully check maven and see the new naming
>>> convention / artifacts.
>>>
>>> So I support option 2 also, with naming with Spark and Scala versions.
>>> Otherwise, we continue to see people using the old “spark3-runtime” as they
>>> upgrade Spark versions and encounter subtle errors (class not found, wrong
>>> type signatures due to version mismatch).
>>>
>>> Users eventually have to upgrade their pom if / when they upgrade Spark,
>>> due to incompatibility. This way at least, breaking will be loud as there’s
>>> won’t be a new Iceberg version,
>>>
>>> Is it possible to mark to the old spark3-runtime / spark-runtime as
>>> deprecated or otherwise point to the new artifacts in Maven?
>>>
>>> - Kyle
>>>
>>> On Sun, Feb 20, 2022 at 9:41 PM Jeff Zhang  wrote:
>>>
>>>> I don't think it is best practice to just change the artifact name of
>>>> published jars. Unless we publish a new version with the new naming
>>>> convention.
>>>>
>>>> On Mon, Feb 21, 2022 at 12:36 PM Jack Ye  wrote:
>>>>
>>>>> I think option 2 is ideal, but I don't know if there is any hard
>>>>> requirement from ASF/Maven Central side for us to keep backwards
>>>>> compatibility of package names published in maven. If there is a
>>>>> requirement then we cannot change it.
>>>>>
>>>>> As a mitigation, I stated in
>>>>> https://iceberg.apache.org/multi-engine-support that Spark 2.4 and
>>>>> 3.0 jar names do not follow the naming convention of newer versions for
>>>>> backwards compatibility.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> On Sun, Feb 20, 2022 at 7:03 PM OpenInx  wrote:
>>>>>
>>>>>> Hi everyone
>>>>>>
>>>>>> The current spark2.4, spark3.0 have the following unaligned runtime
>>>>>> artifact names:
>>>>>>
>>>>>> # Spark 2.4
>>>>>> iceberg-spark-runtime-0.13.1.jar
>>>>>> # Spark 3.0
>>>>>> iceberg-spark3-runtime-0.13.1.jar
>>>>>> # Spark 3.1
>>>>>> iceberg-spark-runtime-3.1_2.12-0.13.1.jar
>>>>>> # Spark 3.2
>>>>>> iceberg-spark-runtime-3.2_2.12-0.13.1.jar
>>>>>>
>>>>>> From the spark 3.1 and spark 3.2's runtime artifact names, we can
>>>>>> easily recognize:
>>>>>> 1. What's the spark major version that the runtime jar is attached to
>>>>>> 2. What's the spark scala version that the runtime jar is compiled
>>>>>> with
>>>>>>
>>>>>> But for spark 3.0 and spark 2.4,  it's not easy to un

[DISCUSS] Align the spark runtime artifact names among spark2.4, spark3.0, spark3.1 and spark3.2

2022-02-20 Thread OpenInx
Hi everyone

The current spark2.4, spark3.0 have the following unaligned runtime
artifact names:

# Spark 2.4
iceberg-spark-runtime-0.13.1.jar
# Spark 3.0
iceberg-spark3-runtime-0.13.1.jar
# Spark 3.1
iceberg-spark-runtime-3.1_2.12-0.13.1.jar
# Spark 3.2
iceberg-spark-runtime-3.2_2.12-0.13.1.jar

>From the spark 3.1 and spark 3.2's runtime artifact names, we can easily
recognize:
1. What's the spark major version that the runtime jar is attached to
2. What's the spark scala version that the runtime jar is compiled with

But for spark 3.0 and spark 2.4,  it's not easy to understand what's the
above information.  I think we kept those legacy names because they were
introduced in older iceberg releases and we wanted to avoid changing the
modules that users depend on and opted not to rename, but they are indeed
causing confusion for the new community users.

In general,   we have two options:

Option#1:  keep the current artifact names, that mean spark 2.4 & spark 3.0
will always use the iceberg-spark-runtime-.jar and
iceberg-spark3-runtime-.jar until them get retired in the
apache iceberg official repo.
Option#2:  Change the spark2.4 & spark3.0's artifact names to the generic
name format:
iceberg-spark-runtime-_-.jar.
 It makes sharing all the consistent name format between all the spark
versions.

Personally, I'd prefer option#2 because that looks more friendly for new
community users (although it will require the old users to change their
pom.xml to the new version).

What is your preference ?

Reference:
1.  Created a PR to change the artifact names and we had few discussions
there. https://github.com/apache/iceberg/pull/4158
2.  https://github.com/apache/iceberg-docs/pull/27#discussion_r800297155


Re: [DISCUSS] Iceberg roadmap

2022-02-17 Thread OpenInx
Update:

As the Dell EMC EcsFileIO has been merged into apache iceberg
official repo, so I think it's okay to get this project from roadmap closed
now: https://github.com/apache/iceberg/projects/22

Thanks.

On Wed, Nov 10, 2021 at 10:22 AM Zhao Chun  wrote:

> Thanks Ryan.
> We will keep a close eye on what is happening in the iceberg community and
> seek help when necessary.
>
> Thanks,
> Zhao Chun
>
>
> Ryan Blue  于2021年11月10日周三 上午8:54写道:
>
>> Thanks, Zhao. I think those are great ways to work together. Let us know
>> how we can help you make StarRocks successful with Iceberg as its data
>> format. We're always happy to help people understand how Iceberg works and
>> improve our docs on how to use it.
>>
>> Ryan
>>
>> On Mon, Nov 8, 2021 at 8:17 PM Zhao Chun  wrote:
>>
>>> I feel that Ryan's response exemplifies the generosity of an Apache
>>> project creator,
>>> a quality that has touched and benefited us. We look forward to
>>> contributing
>>> further to the Apache project in the future.
>>> As for the need for an issue to track progress,I don't think so for now.
>>> At the moment the main development work is done in the StarRocks
>>> repository.
>>> As for further cooperation in the future, I think there are several
>>> aspects.
>>> 1. StarRocks will be trying to support Iceberg.
>>> I think this will help StarRocks to re-examine how it integrates with
>>> the lakehouse system
>>> and we will be happy to feed back to the Apache Iceberg community the
>>> issues and benefits
>>> we encounter during the integration process.
>>> This will also validate the versatility of the iceberg project to
>>> support more query engines.
>>> I think this project will benefit both projects.
>>> 2. In the future, we will share some of our best practices for iceberg
>>> and StarRocks integration in a blog or talk.
>>> If the Apache Iceberg project feels that these blogs or talks would be
>>> beneficial to the Apache iceberg community,
>>> please consider linking our subsequent blogs or talks to the apache
>>> iceberg website blog.
>>> The Iceberg community can, of course, not link if they feel it is
>>> inappropriate.
>>> 3. we expect to contribute to the Apache Iceberg community under the
>>> Apache License V2.
>>>
>>> Thanks,
>>> Zhao Chun
>>>
>>>
>>> Ryan Blue  于2021年11月9日周二 上午3:05写道:
>>>
>>>> I think it is great to see another processing engine adding support for
>>>> Apache Iceberg, and I do look forward to collaborating with the StarRocks
>>>> community in the future.
>>>>
>>>> I'm not entirely sure what that collaboration would look like just yet
>>>> though. For most processing engines, it is people joining the Apache
>>>> Iceberg community. No matter what the license of the downstream project, we
>>>> always welcome more people contributing here!
>>>>
>>>> As for opening a project in our tracker, I'm not sure it makes sense to
>>>> do that just yet. As far as I know there aren't any issues to track there.
>>>> And would the StarRocks community find it helpful?
>>>>
>>>> On Mon, Nov 8, 2021 at 12:14 AM Zhao Chun  wrote:
>>>>
>>>>> Thanks to @OpenInx for mentioning StarRocks in the iceberg community.
>>>>>
>>>>> I'm from the StarRocks community.
>>>>>
>>>>> StarRocks is based on the Apache Doris project.
>>>>> It has been in development internally for almost two years and is
>>>>> currently used by hundreds of companies.
>>>>> It was just opened 2 months ago.
>>>>>
>>>>> Iceberg is a great project that makes huge datasets analysis more
>>>>> convenient.
>>>>> The StarRocks community is planning to support the iceberg engine.
>>>>> This will provide StarRocks users with the ability to analyze data in
>>>>> iceberg.
>>>>>
>>>>> Regarding the license, StarRocks' ELv2 will not affect our
>>>>> contribution to the iceberg community under the Apache License V2.
>>>>>
>>>>> We are also looking forward to receiving help from the iceberg
>>>>> community and will be contributing back to the iceberg community.
>>>>>
>>>>> Thanks,
>>>>> Zhao Chun
>>>>>
>>>>>
>>>>&g

Re: New Versioned Iceberg Documentation Site

2022-02-06 Thread OpenInx
The new site looks great to me, thanks all for the work !

One unrelated thing:  I remember we had a discussion to bring a new page in
the doc site to collect all the design docs (such as google doc, github
issues etc),   is there any progress for this thing ?  Someone who
connected to me has raised the same question to me before but I did not
find the page finally, so I brought it up to this mail list..

Thanks.

On Sun, Feb 6, 2022 at 10:18 AM Sam Redai  wrote:

> Thanks for all of the comments and feedback!
>
> @asingh yes I think once we release the new site with versioned docs, we
> can start to add things like docker images and more examples. I have some
> ideas there but PRs will definitely be welcome! Also I agree, a getting
> started page with docker examples would be a good landing-page item with a
> Learn More button that takes you directly to a page with more details.
>
> @omar the font on the Navbar is definitely smaller, mainly to fit all of
> the items. If we go with a drop down menu instead, we can increase the font
> size.
>
> Jack and I have been putting the finishing touches on the site. I’m
> looking forward to its release this coming week!
>
> -Sam
>
> On Wed, Feb 2, 2022 at 12:34 PM Ashish Singh  wrote:
>
>> The website is really great. Thanks Sam!
>>
>> Is there any plan to link docker images for users to download and try out
>> the examples listed on the homepage? If yes, should we maybe add it
>> explicitly on the homepage (somewhat at bottom after all the use-cases)?
>> Doesn't have to be in the first version.
>>
>
>>
>> On Tue, Feb 1, 2022 at 1:50 PM Omar Al-Safi  wrote:
>>
>>> Also, wouldn't it make sense to add a section to reach the
>>> integration page from the landing page directly (let's see a drop down
>>> menu) at the top section?
>>>
>>> Regards,
>>>
>>> On Tue, Feb 1, 2022 at 10:46 PM Omar Al-Safi  wrote:
>>>
 Looking good, thanks Sam!

 However, is it me or is the top section a bit smaller in comparison to
 the overall page?

 [image: Screenshot 2022-02-01 at 22.40.49.png]

 Regards,
 Omar

 On Tue, Feb 1, 2022 at 9:09 PM Kyle Bendickson  wrote:

> +1 from me. This looks great. Thank you for all your hard work, Sam!
>
> On Tue, Feb 1, 2022 at 10:33 AM Jack Ye  wrote:
>
>> +1, amazing website! And now the website repo is separated we can
>> continue to iterate and deploy quickly without affecting the main repo, 
>> so
>> no need to be 100% perfect as of now.
>>
>> I will update the 0.13 release note against the new website and we
>> can announce them together.
>>
>> Best,
>> Jack Ye
>>
>> On Tue, Feb 1, 2022 at 8:26 AM Ryan Blue  wrote:
>>
>>> Good catch. Looks like the anchor links on the spec page are broken.
>>> We'll have to get those fixed.
>>>
>>> I think we should move forward with the update and fix these as we
>>> come across them. It's inevitable that we'll have some broken things in 
>>> a
>>> big change, but we don't want to block this improvement on being 100%
>>> perfect.
>>>
>>> On Tue, Feb 1, 2022 at 1:21 AM Ajantha Bhat 
>>> wrote:
>>>
 Nice looking website.

 Is the shared link the final version ? I couldn't see the markdown
 anchor tag inside https://iceberg.redai.dev/spec/
 It will be useful to have that for sharing specific parts of the
 spec.

 Also some pages are in light theme and some are in dark theme.
 Better to have a unified theme.

 +1 for versioning and overall work.

 Thanks,
 Ajantha

 On Tue, Feb 1, 2022 at 1:47 PM Eduard Tudenhoefner <
 edu...@dremio.com> wrote:

> +1 on the procedure and the new site looks amazing
>
> On Tue, Feb 1, 2022 at 3:38 AM Ryan Blue  wrote:
>
>> +1 from me. I think the new site looks great and it is a big
>> improvement to have version-specific docs. Thanks for all your work 
>> on
>> this, Sam!
>>
>> On Mon, Jan 31, 2022 at 5:48 PM Sam Redai  wrote:
>>
>>> Hey Everyone,
>>>
>>> With 0.13.0's approval for release, I think this would be a good
>>> time to have a discussion around the proposed versioned 
>>> documentation site,
>>> powered by Hugo. The site is ready to be released and the source 
>>> code for
>>> the site can be found in the apache/iceberg-docs repository:
>>> https://github.com/apache/iceberg-docs.
>>>
>>> In order for everyone to see a dev version of the site live,
>>> I've deployed it temporarily to: https://iceberg.redai.dev
>>>
>>> The markdown files will remain in the apache/iceberg repository
>>> and will represent the latest unreleased documentation. PRs for 
>>> changes 

Re: Vendor integration strategy

2021-12-13 Thread OpenInx
Thanks Jack for the feedback !

Yes, I agree that the ideal approach is option#3 as you said.  The core
issue for the vendor bundled runtimes jar is :  More vendors currently
provide maven dependency configuration, but may not provide bundled runtime
jar. Because maven dependency configuration can solve more general
problems, for example, external users can selectively choose dependencies,
or they can package their own runtime packages. I believe that if we are
the sdk publisher of cloud vendors, we will give priority to publishing
maven configuration instead of directly publishing bundled runtime jar.

So in the short term, although option#3 is an ideal way, it may actually be
option2. However, I think we have taken the first step well (integrating
iceberg FileIO), and I can try to push Aliyun to release their own bundled
jars to simplify the user experience of our iceberg users accessing Aliyun
oss services.

Thanks.

On Tue, Dec 14, 2021 at 9:13 AM Jack Ye  wrote:

> Thank you Openinx for preparing all these PRs and the vote options!
>
> In the community sync, we also talked about not including any new vendor
> integration modules in engine runtimes. In this approach, vendor
> dependencies do not need to be listed in the provided (compile only) scope.
> Vendors will publish their own runtime but outside the Apache Iceberg
> project if they want to have a runtime jar. We can list that as option #3.
>
> I would vote for #2,  because:
> 1. It is the current integration pattern of AWS and Nessie. I think it's a
> consistent path forward, and does not make Nessie and AWS special cases.
> 2. It only adds the few classes implemented inside Iceberg so it does not
> inflate the runtime jar with vendor dependencies.
>
> Between option #1 and #3, the only difference is if we offer a runtime jar
> or not. My understanding is that currently we have some people against it
> due to legal liability. I think I totally understand what openinx advocates
> for a simple user experience, but from the consistency perspective, that
> means all vendors will publish a runtime, and we don't know if any of those
> would have licensing issues in the future, so I would be a bit hesitant to
> go with #1.
>
> Between option #2 and #3, we will need to specify a list of jars when
> using an engine runtime anyway. I think it's a bit more beneficial to
> specify fewer jars by just bundling all the Iceberg integration classes in
> the runtimes, so users only need to consider what vendor dependencies are
> missing in their execution environment.
>
> We are currently also adding a module for Google Cloud, it would be great
> if Daniel could provide some opinions here.
> https://github.com/apache/iceberg/pull/3711
>
> -Jack
>
> On Sun, Dec 12, 2021 at 11:58 PM OpenInx  wrote:
>
>> As the release 0.13.0 is coming,   I don't hope this bundled issue blocks
>> the 0.13.0 release progress. So I prepared two options for iceberg devs to
>> vote:
>>
>> Option#1:  Bundled the iceberg-aliyun and all the dependencies into a
>> single bundled jar, named iceberg-aliyun-runtime.
>>
>> The PR is:  https://github.com/apache/iceberg/pull/3684
>> The usage is here: https://github.com/apache/iceberg/pull/3686/files
>>
>> Option#2:  Add only the iceberg-aliyun (without aliyun-oss sdk deps) into
>> flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk
>> externally by hand.
>>
>> The PR is: https://github.com/apache/iceberg/pull/3725
>> The usage example is here:
>> https://github.com/apache/iceberg/pull/3725#issue-800973927
>>
>> We can vote for option#1 or option#2.
>>
>> Any feedback is welcome, thanks in advance.
>>
>> On Thu, Dec 9, 2021 at 8:29 PM OpenInx  wrote:
>>
>>> Thanks Jack for bringing this up, and thanks Ryan for sharing your
>>> point.
>>>
>>> > Getting a minimal set of transitive dependencies, relocating the
>>> classes that they pull in to avoid conflicts, and tracking licensing is a
>>> huge amount of work that has so far been done or validated by a very small
>>> set of people.
>>>
>>> I did the iceberg-flink-runtime package work before. In that time, I
>>> need to search all the dependencies from that module and pick out all the
>>> licenses & notices and relocate all the common packages.  Yes, it's a huge
>>> amount of work.  But I think great open source software should solve those
>>> abstract common problems, recalling that we were discussing whether we need
>>> to support multiple versions of the same engine in apache iceberg. I
>>> remember that Ryan said at the time that if we do not solve this problem in
>>> the off

Re: Vendor integration strategy

2021-12-12 Thread OpenInx
As the release 0.13.0 is coming,   I don't hope this bundled issue blocks
the 0.13.0 release progress. So I prepared two options for iceberg devs to
vote:

Option#1:  Bundled the iceberg-aliyun and all the dependencies into a
single bundled jar, named iceberg-aliyun-runtime.

The PR is:  https://github.com/apache/iceberg/pull/3684
The usage is here: https://github.com/apache/iceberg/pull/3686/files

Option#2:  Add only the iceberg-aliyun (without aliyun-oss sdk deps) into
flink/spark/hive runtime jars, and people need to load those aliyun-oss sdk
externally by hand.

The PR is: https://github.com/apache/iceberg/pull/3725
The usage example is here:
https://github.com/apache/iceberg/pull/3725#issue-800973927

We can vote for option#1 or option#2.

Any feedback is welcome, thanks in advance.

On Thu, Dec 9, 2021 at 8:29 PM OpenInx  wrote:

> Thanks Jack for bringing this up, and thanks Ryan for sharing your point.
>
> > Getting a minimal set of transitive dependencies, relocating the classes
> that they pull in to avoid conflicts, and tracking licensing is a huge
> amount of work that has so far been done or validated by a very small set
> of people.
>
> I did the iceberg-flink-runtime package work before. In that time, I need
> to search all the dependencies from that module and pick out all the
> licenses & notices and relocate all the common packages.  Yes, it's a huge
> amount of work.  But I think great open source software should solve those
> abstract common problems, recalling that we were discussing whether we need
> to support multiple versions of the same engine in apache iceberg. I
> remember that Ryan said at the time that if we do not solve this problem in
> the official Apache iceberg repo, it means that every user needs to
> manually solve these multi-version compatibility problems.  It is the
> abstract common problem that I mentioned. This is why I am very pleased to
> devote my bandwidth to multiple-version support, although I initially voted
> in the opposite direction.
>
> Back to this vendor bundle runtime jar issue,  it's still the abstract
> common problem.  If we don't solve the problem, that means everyone who
> wants to access the iceberg tables in aliyun need to build their own bundle
> runtime jar to make this work.  We may argue that it's the vendor's duty to
> provide the vendor bundle sdk (which is similar to the AWS bundle SDK),
> but I don't think every vendor who wants to integrate apache iceberg has
> provided the bundle SDK. I checked the aliyun client SDK, only the aliyun
> object storage service has provided the SDK package [1] , but it's a zip
> package with all individual dependencies in it, which means we still need
> to load the individual dependencies one by one for flink/hive.  This will
> make it costly for users to access the iceberg table, and even eventually
> cause users to give up using iceberg.
>
> As for the legal or license issues, I checked all the transitive
> dependencies from iceberg-aliyun [2], all the dependencies are apache
> license friendly and are allowed to redistribute. For my understanding, it
> should not be a problem.  Besides, the apache hadoop release has already
> included aliyun oss sdk into it, I think it provides an example.
>
> [1]. https://www.alibabacloud.com/help/en/doc-detail/32009.html
> [2]. https://github.com/apache/iceberg/pull/3684
>
>
> On Thu, Dec 9, 2021 at 12:31 AM Ryan Blue  wrote:
>
>> The main problem with creating runtime Jars is transitive dependencies.
>> Getting a minimal set of transitive dependencies, relocating the classes
>> that they pull in to avoid conflicts, and tracking licensing is a huge
>> amount of work that has so far been done or validated by a very small set
>> of people.
>>
>> In addition, it is easy to make mistakes here. Updating a dependency can
>> inadvertently pull in extra transitive dependencies that have incompatible
>> licenses, aren't relocated, or otherwise cause significant license or
>> runtime problems.
>>
>> We currently support runtime Jars for engines because it would otherwise
>> be very difficult for people to use Iceberg. I don't think that same logic
>> applies to vendor bundles. So the main question is: why are we doing this
>> in Iceberg? Couldn't this integration be provided as a third-party Jar? The
>> FileIO API is quite stable. And while I think it makes sense to have the
>> implementations in Iceberg for maintenance, I don't think that it makes
>> sense to provide a runtime Jar.
>>
>> I could be convinced otherwise, but I'm skeptical.
>>
>> Ryan
>>
>> On Tue, Dec 7, 2021 at 7:52 PM Jack Ye  wrote:
>>
>>> Hi everyone,
>>>
>>> As we are adding Aliyun as

Re: Vendor integration strategy

2021-12-09 Thread OpenInx
>> usually maintain their own version of AWS SDK and would like to upgrade
>> them independent of the AWS SDK version used by Iceberg. Although currently
>> it takes more effort for users to specify all the compile-only
>> dependencies, compute vendor services like AWS EMR are going to offer all
>> the jars directly in the classpath to avoid such need in the very near
>> future, and EMR will maintain their AWS SDK version upgrade independently.
>>
>> But the approach proposed by Aliyun seems to fit the use case of Aliyun
>> users better. For more context, please read
>> https://github.com/apache/iceberg/pull/3270 for the discussion between
>> me and Openinx and https://github.com/apache/iceberg/pull/3684 for the
>> approach proposed.
>>
>> I think we should consolidate the vendor integration strategy going
>> forward. It could be we support both approaches, or just choose one
>> approach going forward. It would be great if people with similar experience
>> or need could provide some insights.
>>
>> Best,
>> Jack Ye
>>
>>
>>
>
> --
> Ryan Blue
> Tabular
>


Re: Welcome new PMC members!

2021-11-17 Thread OpenInx
Congrats,  Jack and Russell !  Well deserved !

On Thu, Nov 18, 2021 at 9:08 AM karuppayya  wrote:

> Congratulations Russell and Jack!!
>
> - Karuppayya
>
> On Wed, Nov 17, 2021 at 5:02 PM Yufei Gu  wrote:
>
>> Congratulations, Jack and Russell!
>>
>> Best,
>>
>> Yufei
>>
>> `This is not a contribution`
>>
>>
>> On Wed, Nov 17, 2021 at 4:19 PM Neelesh Salian 
>> wrote:
>>
>>> Congratulations Jack and Russell. Well deserved.
>>>
>>>
>>> On Wed, Nov 17, 2021 at 4:13 PM Kyle Bendickson  wrote:
>>>
 Congratulations to both Jack and Russell!

 Very we deserved indeed :)

 On Wed, Nov 17, 2021 at 4:12 PM Ryan Blue  wrote:

> Hi everyone, I want to welcome Jack Ye and Russell Spitzer to the
> Iceberg PMC. They've both been amazing at reviewing and helping people in
> the community and the PMC has decided to invite them to join.
> Congratulations, Jack and Russell! Thank you for all your hard work and
> support for the project.
>
> Ryan
>
> --
> Ryan Blue
>

>>>
>>> --
>>> Regards,
>>> Neelesh S. Salian
>>>
>>>


Re: Upcoming Iceberg Community Sync (11/17 9:00am PT)

2021-11-16 Thread OpenInx
The correct PR link contributed by Reo-LEI is:
https://github.com/apache/iceberg/pull/3240, so let's update the following
part:

  b.   v2's extra meta columns messed up the flink's RowData pos:
 * https://github.com/apache/iceberg/pull/3240 (Thanks Reo-LEI )
 * Another related PR to enhance the unit tests is:
https://github.com/apache/iceberg/pull/3477 (Need someone to review & merge
this).


On Wed, Nov 17, 2021 at 10:03 AM OpenInx  wrote:

> Let me give more inputs from my perspective.
>
> 1.   Fixed few critical flink v2 reader bugs:
>   a.  The flink avro reader bug:
> https://github.com/apache/iceberg/pull/3540
>   b.   v2's extra meta columns messed up the flink's RowData pos:
>  * https://github.com/apache/iceberg/pull/3540 (Thanks
> Reo-LEI )
>  * Another related PR to enhance the unit tests is:
> https://github.com/apache/iceberg/pull/3477 (Need someone to review &
> merge this).
>
> 2.  Split the whole flip-27 source reader into small PR for reviewing
> (Thanks Stevenzwu).
>   a.  The first small PR is:
> https://github.com/apache/iceberg/pull/3501
>
> 3.  Support the latest flink 1.14.0:
> https://github.com/apache/iceberg/pull/3434 (Need an apache iceberg
> committer or PMC to review).
>
> 4.  The aliyun OSS integration work is almost close:
> https://github.com/apache/iceberg/projects/21,  I think it's okay to
> release in iceberg 0.13.0, there are only few small issues need to merge
> before making the aliyun oss storage work.
>
>
> On Tue, Nov 16, 2021 at 9:25 PM Sam Redai  wrote:
>
>> Hey everyone,
>>
>> This is just a friendly reminder that the next Iceberg Community Sync is
>> tomorrow (11/17) at 9:00AM PT. The meeting will be recorded and shared
>> shortly after on the dev mailing list in case you can’t make it.
>>
>> Friendly reminder to add any highlights, agenda, or discussion items to
>> the attached meeting doc!
>>
>>  Iceberg community syncs
>> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
>>
>> Best,
>> Sam
>>
>


Re: Upcoming Iceberg Community Sync (11/17 9:00am PT)

2021-11-16 Thread OpenInx
Let me give more inputs from my perspective.

1.   Fixed few critical flink v2 reader bugs:
  a.  The flink avro reader bug:
https://github.com/apache/iceberg/pull/3540
  b.   v2's extra meta columns messed up the flink's RowData pos:
 * https://github.com/apache/iceberg/pull/3540 (Thanks Reo-LEI )
 * Another related PR to enhance the unit tests is:
https://github.com/apache/iceberg/pull/3477 (Need someone to review & merge
this).

2.  Split the whole flip-27 source reader into small PR for reviewing
(Thanks Stevenzwu).
  a.  The first small PR is: https://github.com/apache/iceberg/pull/3501

3.  Support the latest flink 1.14.0:
https://github.com/apache/iceberg/pull/3434 (Need an apache iceberg
committer or PMC to review).

4.  The aliyun OSS integration work is almost close:
https://github.com/apache/iceberg/projects/21,  I think it's okay to
release in iceberg 0.13.0, there are only few small issues need to merge
before making the aliyun oss storage work.


On Tue, Nov 16, 2021 at 9:25 PM Sam Redai  wrote:

> Hey everyone,
>
> This is just a friendly reminder that the next Iceberg Community Sync is
> tomorrow (11/17) at 9:00AM PT. The meeting will be recorded and shared
> shortly after on the dev mailing list in case you can’t make it.
>
> Friendly reminder to add any highlights, agenda, or discussion items to
> the attached meeting doc!
>
>  Iceberg community syncs
> 
>
> Best,
> Sam
>


Re: [DISCUSS] Iceberg roadmap

2021-11-07 Thread OpenInx
Any thoughts for adding StarRocks integration to the roadmap ?

I think the guys from StarRocks community can provide more background and
inputs.

On Thu, Nov 4, 2021 at 5:59 PM OpenInx  wrote:

> Update:
>
> StarRocks[1] is a next-gen sub-second MPP database for full analysis
> scenarios, including multi-dimensional analytics, real-time analytics and
> ad-hoc query.  Their team is planning to integrate iceberg tables as
> StarRocks external tables in the next month [2], so that people could
> connect the data lake and StarRocks warehouse in the same engine.
> The excellent performance of StarRocks will also help accelerate the
> analysis and access of the iceberg table, I think this is a great thing for
> both the iceberg community and the StarRocks community.   I think we can
> add an extra project about StarRocks integration work in the apache iceberg
> roadmap [3] ?
>
> [1].  https://github.com/StarRocks/starrocks
> [2].  https://github.com/StarRocks/starrocks/issues/1030
> [3].  https://github.com/apache/iceberg/projects
>
> On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue  wrote:
>
>> I closed the upgrade project and marked the FLIP-27 project priority 1.
>> Thanks for all the work to get this done!
>>
>> On Sun, Oct 31, 2021 at 8:10 PM OpenInx  wrote:
>>
>>> Update:
>>>
>>> I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can be
>>> closed now, because all of the issues have been addressed.
>>>
>>> [1]. https://github.com/apache/iceberg/projects/12
>>>
>>> On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner 
>>> wrote:
>>>
>>>> I created a Roadmap section in
>>>>  https://github.com/apache/iceberg/pull/3163
>>>> <https://github.com/apache/iceberg/pull/3163> that links to the
>>>> planning boards that Jack created. I figured it makes sense if we link
>>>> available Design Docs directly on those Boards (as was already done),
>>>> because then the Design docs are closer to the set of related issues.
>>>>
>>>> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue  wrote:
>>>>
>>>>> Thanks, Jack!
>>>>>
>>>>> Eduard, I think that's a good idea. We should have a roadmap page as
>>>>> well that links to the projects that Jack just created.
>>>>>
>>>>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye  wrote:
>>>>>
>>>>>> It seems like we have reached some consensus around the projects
>>>>>> listed here. I have created corresponding Github projects for each:
>>>>>> https://github.com/apache/iceberg/projects
>>>>>>
>>>>>> Related design docs are also linked there.
>>>>>>
>>>>>> Best,
>>>>>> Jack Ye
>>>>>>
>>>>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner <
>>>>>> edu...@dremio.com> wrote:
>>>>>>
>>>>>>> Would it make sense to have a section on the website where we
>>>>>>> collect all the links to the design docs/specs as that would be easier 
>>>>>>> to
>>>>>>> find than searching for things on the ML?
>>>>>>>
>>>>>>> I was thinking about something like for each component:
>>>>>>> * link to the ML discussion
>>>>>>> * link to the actual Spec/Design Doc
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> At the last sync meeting, we brought up publishing a community
>>>>>>>> roadmap and brainstormed the many features and initiatives that the
>>>>>>>> community is working on. In this thread, I want to make sure that we 
>>>>>>>> have a
>>>>>>>> good list of what people are thinking about and I think we should try 
>>>>>>>> to
>>>>>>>> categorize the projects by size and general priority. When we reach a 
>>>>>>>> rough
>>>>>>>> agreement, I’ll write this up and post it on the ASF site along with 
>>>>>>>> links
>>>>>>>> to some projects in Github.
>>>>>>>>
>>>>>>>> My rationale for attempting to 

Re: [VOTE] Release Apache Iceberg 0.12.1 RC0

2021-11-05 Thread OpenInx
+1  (binding)

1. Download the source tarball, signature (.asc), and checksum (.sha512):
 OK
2. Import gpg keys: download KEYS and run gpg --import
/path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
3. Verify the signature by running: gpg --verify
apache-iceberg-xx-incubating.tar.gz.asc:  OK
4. Verify the checksum by running: shasum -a 256 -c
apache-iceberg-0.12.1.tar.gz.sha512 apache-iceberg-0.12.1.tar.gz :  OK
5. Untar the archive and go into the source directory: tar xzf
apache-iceberg-xx-incubating.tar.gz && cd apache-iceberg-xx-incubating:  OK
6. Run RAT checks to validate license headers: dev/check-license: OK
7. Build and test the project: ./gradlew build (use Java 8) :   OK
8. Check the flink works fine by the following command line:

./bin/sql-client.sh embedded -j
/Users/openinx/Downloads/apache-iceberg-0.12.1/flink-runtime/build/libs/iceberg-flink-runtime-0.12.1.jar
shell

CREATE CATALOG hadoop_prod WITH (
'type'='iceberg',
'catalog-type'='hadoop',
'warehouse'='file:///Users/openinx/test/iceberg-warehouse'
);

CREATE TABLE `hadoop_prod`.`default`.`flink_table` (
id BIGINT,
data STRING
);

INSERT INTO `hadoop_prod`.`default`.`flink_table` VALUES (1, 'AAA');
SELECT * FROM `hadoop_prod`.`default`.`flink_table`;
++--+
| id | data |
++--+
| 1 | AAA |
++--+
1 row in set

Thanks all for the work.

On Fri, Nov 5, 2021 at 2:20 PM Cheng Pan  wrote:

> +1 (non-binding)
>
> The integration test based on the master branch of Apache Kyuubi
> (Incubating) passed.
>
> https://github.com/apache/incubator-kyuubi/pull/1338
>
> Thanks,
> Cheng Pan
>
> On Fri, Nov 5, 2021 at 1:19 PM Kyle Bendickson  wrote:
> >
> >
> > +1 (binding)
> >
> > - Validated checksums, signatures, and licenses
> > -  Ran all of the unit tests
> > - Imported Files from Orc tables via Spark stored procedure, with
> floating point type columns and inspected the metrics afterwards
> > - Registered and used bucketed UDFs for various types such as integer
> and byte
> > - Created and dropped tables
> > - Ran MERGE INTO queries using Spark DDL
> > - Verified ability to read tables with parquet files with nested map
> type schema from various versions (both before and after Parquet 1.11.0 ->
> 1.11.1 upgrade)
> > - Tried to set a tblproperty to null (received error as expected)
> > - Full unit test suite
> > - Ran several Flink queries, both batch and streaming.
> > - Tested against a custom catalog
> >
> > My spark configuration was very similar to Ryan’s. I used Flink 1.12.1
> on a docker-compose setup via the Flink SQL client with 2 task managers.
> >
> > In addition to testing with a custom catalog, I also tested with HMS /
> Hive catalog with HDFS as storage as well as Hadoop Catalog with data on
> (local) HDFS.
> >
> > I’ve not gotten the Hive3 errors despite running unit tests several
> times.
> >
> > - Kyle (@kbendick)
> >
> >
> > On Thu, Nov 4, 2021 at 9:57 PM Daniel Weeks  wrote:
> >>
> >> +1 (binding)
> >>
> >> Verified sigs, sums, license, build and test.
> >>
> >> -Dan
> >>
> >> On Thu, Nov 4, 2021 at 4:30 PM Ryan Blue  wrote:
> >>>
> >>> +1 (binding)
> >>>
> >>> Validated checksums, checked signature, ran tests (still a couple
> failing in Hive3)
> >>> Staged binaries from the release tarball
> >>> Tested Spark metadata tables
> >>> Used rewrite_manifests stored procedure in Spark
> >>> Updated to v2 using SET TBLPROPERTIES
> >>> Dropped and added partition fields
> >>> Replaced a table with itself using INSERT OVERWRITE
> >>> Tested custom catalogs
> >>>
> >>> Here’s my Spark config script in case anyone else wants to validate:
> >>>
> >>> /home/blue/Apps/spark-3.1.1-bin-hadoop3.2/bin/spark-shell \
> >>> --conf spark.jars.repositories=
> https://repository.apache.org/content/repositories/orgapacheiceberg-1019/
> \
> >>> --packages org.apache.iceberg:iceberg-spark3-runtime:0.12.1 \
> >>> --conf
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
> \
> >>> --conf
> spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
> >>> --conf spark.sql.catalog.local.type=hadoop \
> >>> --conf
> spark.sql.catalog.local.warehouse=/home/blue/tmp/hadoop-warehouse \
> >>> --conf spark.sql.catalog.local.default-namespace=default \
> >>> --conf
> spark.sql.catalog.prodhive=org.apache.iceberg.spark.SparkCatalog \
> >>> --conf spark.

Re: [DISCUSS] Iceberg roadmap

2021-11-04 Thread OpenInx
Update:

StarRocks[1] is a next-gen sub-second MPP database for full analysis
scenarios, including multi-dimensional analytics, real-time analytics and
ad-hoc query.  Their team is planning to integrate iceberg tables as
StarRocks external tables in the next month [2], so that people could
connect the data lake and StarRocks warehouse in the same engine.
The excellent performance of StarRocks will also help accelerate the
analysis and access of the iceberg table, I think this is a great thing for
both the iceberg community and the StarRocks community.   I think we can
add an extra project about StarRocks integration work in the apache iceberg
roadmap [3] ?

[1].  https://github.com/StarRocks/starrocks
[2].  https://github.com/StarRocks/starrocks/issues/1030
[3].  https://github.com/apache/iceberg/projects

On Mon, Nov 1, 2021 at 11:52 PM Ryan Blue  wrote:

> I closed the upgrade project and marked the FLIP-27 project priority 1.
> Thanks for all the work to get this done!
>
> On Sun, Oct 31, 2021 at 8:10 PM OpenInx  wrote:
>
>> Update:
>>
>> I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can be
>> closed now, because all of the issues have been addressed.
>>
>> [1]. https://github.com/apache/iceberg/projects/12
>>
>> On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner 
>> wrote:
>>
>>> I created a Roadmap section in
>>>  https://github.com/apache/iceberg/pull/3163
>>> <https://github.com/apache/iceberg/pull/3163> that links to the
>>> planning boards that Jack created. I figured it makes sense if we link
>>> available Design Docs directly on those Boards (as was already done),
>>> because then the Design docs are closer to the set of related issues.
>>>
>>> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue  wrote:
>>>
>>>> Thanks, Jack!
>>>>
>>>> Eduard, I think that's a good idea. We should have a roadmap page as
>>>> well that links to the projects that Jack just created.
>>>>
>>>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye  wrote:
>>>>
>>>>> It seems like we have reached some consensus around the projects
>>>>> listed here. I have created corresponding Github projects for each:
>>>>> https://github.com/apache/iceberg/projects
>>>>>
>>>>> Related design docs are also linked there.
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner <
>>>>> edu...@dremio.com> wrote:
>>>>>
>>>>>> Would it make sense to have a section on the website where we collect
>>>>>> all the links to the design docs/specs as that would be easier to find 
>>>>>> than
>>>>>> searching for things on the ML?
>>>>>>
>>>>>> I was thinking about something like for each component:
>>>>>> * link to the ML discussion
>>>>>> * link to the actual Spec/Design Doc
>>>>>>
>>>>>> Thoughts?
>>>>>>
>>>>>> On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue  wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> At the last sync meeting, we brought up publishing a community
>>>>>>> roadmap and brainstormed the many features and initiatives that the
>>>>>>> community is working on. In this thread, I want to make sure that we 
>>>>>>> have a
>>>>>>> good list of what people are thinking about and I think we should try to
>>>>>>> categorize the projects by size and general priority. When we reach a 
>>>>>>> rough
>>>>>>> agreement, I’ll write this up and post it on the ASF site along with 
>>>>>>> links
>>>>>>> to some projects in Github.
>>>>>>>
>>>>>>> My rationale for attempting to prioritize projects is that if we try
>>>>>>> to do too many things, it will be slower progress across everything 
>>>>>>> rather
>>>>>>> than getting a few important items done. I know that priorities don’t 
>>>>>>> align
>>>>>>> very cleanly in practice, but it is hopefully worth trying. To come up 
>>>>>>> with
>>>>>>> a priority, I’m trying to keep top priority items to a minimum by 
>>>>>>> including
>>>&

Re: [DISCUSS] Iceberg roadmap

2021-10-31 Thread OpenInx
Update:

I think the project  [Flink: Upgrade to 1.13.2][1] in RoadMap can be closed
now, because all of the issues have been addressed.

[1]. https://github.com/apache/iceberg/projects/12

On Tue, Sep 21, 2021 at 6:17 PM Eduard Tudenhoefner 
wrote:

> I created a Roadmap section in https://github.com/apache/iceberg/pull/3163
>  that links to the planning
> boards that Jack created. I figured it makes sense if we link available
> Design Docs directly on those Boards (as was already done), because then
> the Design docs are closer to the set of related issues.
>
> On Mon, Sep 20, 2021 at 10:02 PM Ryan Blue  wrote:
>
>> Thanks, Jack!
>>
>> Eduard, I think that's a good idea. We should have a roadmap page as well
>> that links to the projects that Jack just created.
>>
>> On Mon, Sep 20, 2021 at 12:57 PM Jack Ye  wrote:
>>
>>> It seems like we have reached some consensus around the projects listed
>>> here. I have created corresponding Github projects for each:
>>> https://github.com/apache/iceberg/projects
>>>
>>> Related design docs are also linked there.
>>>
>>> Best,
>>> Jack Ye
>>>
>>> On Sun, Sep 19, 2021 at 11:18 PM Eduard Tudenhoefner 
>>> wrote:
>>>
 Would it make sense to have a section on the website where we collect
 all the links to the design docs/specs as that would be easier to find than
 searching for things on the ML?

 I was thinking about something like for each component:
 * link to the ML discussion
 * link to the actual Spec/Design Doc

 Thoughts?

 On Fri, Sep 10, 2021 at 11:38 PM Ryan Blue  wrote:

> Hi everyone,
>
> At the last sync meeting, we brought up publishing a community roadmap
> and brainstormed the many features and initiatives that the community is
> working on. In this thread, I want to make sure that we have a good list 
> of
> what people are thinking about and I think we should try to categorize the
> projects by size and general priority. When we reach a rough agreement,
> I’ll write this up and post it on the ASF site along with links to some
> projects in Github.
>
> My rationale for attempting to prioritize projects is that if we try
> to do too many things, it will be slower progress across everything rather
> than getting a few important items done. I know that priorities don’t 
> align
> very cleanly in practice, but it is hopefully worth trying. To come up 
> with
> a priority, I’m trying to keep top priority items to a minimum by 
> including
> only one from each group (Spark, Flink, Python, etc.). The remaining items
> are split between priority 2 and 3. Priority 3 is not urgent, including
> things that can be plugged in (like other IO libraries), docs, etc.
> Everything else is priority 2.
>
> That something isn’t priority 1 doesn’t mean it isn’t important or
> progressing, just that it isn’t the current focus. I think of it this way:
> if someone has extra time to review something, what should be next? That’s
> top priority.
>
> Here’s my rough categorization. If you disagree, please speak up:
>
>- If you think that something should be top priority, what gets
>moved to priority 2?
>- Should the priority for a project in 2 or 3 change?
>- Is the S/M/L size of a project wrong?
>
> Top priority, 1:
>
>- API: Iceberg 1.0 [medium]
>- Spark: Merge-on-read plans [large]
>- Maintenance: Delete file compaction [medium]
>-
>
>Flink: Upgrade to 1.13.2 (document compatibility) [medium]
>-
>
>Python: Pythonic refactor [medium]
>
> Priority 2:
>
>- ORC: Support delete files stored as ORC [small]
>- Spark: DSv2 streaming improvements [small]
>- Flink: Inline file compaction [small]
>- Flink: Support UPSERT [small]
>- Views: Spec [medium]
>- Spec: Z-ordering / Space-filling curves [medium]
>- Spec: Snapshot tagging and branching [small]
>- Spec: Secondary indexes [large]
>- Spec v3: Encryption [large]
>-
>
>Spec v3: Relative paths [large]
>-
>
>Spec v3: Default field values [medium]
>
> Priority 3:
>
>- Docs: versioned docs [medium]
>- IO: Support Aliyun OSS/DLF [medium]
>- IO: Support Dell ECS [medium]
>
> External:
>
>- Trino: Bucketed joins [small]
>- Trino: Row-level delete support [medium]
>- Trino: Merge-on-read plans [medium]
>- Trino: Multi-catalog support [small]
>
> --
> Ryan Blue
> Tabular
>

>>
>> --
>> Ryan Blue
>> Tabular
>>
>


Re: Iceberg 0.12.1 Patch Release - Call for Bug Fixes and Patches

2021-10-27 Thread OpenInx
I think it's good to invite apache iceberg hive maintainers to evaluate
whether this issue should be a blocker for 0.12.1.   I'm okay for either.

On Wed, Oct 27, 2021 at 11:50 PM Ryan Blue  wrote:

> I'm not sure that #3393 is necessarily something that we should wait for.
> If it gets in soon, I'd be all for including it. But there are some
> important correctness fixes going into 0.12.1 for delete file commits and
> I'd like to get those out as soon as possible.
>
> It looks like this bug affects Hive and is a failure, not a correctness
> problem. I would probably opt to continue with 0.12.1 and follow up with
> 0.12.2 once this is fixed if we think that it is affecting enough people
> that a patch release is warranted. And if we don't think that a patch
> release for this is needed, then I think that makes it less important to
> get it into 0.12.1.
>
> What does everyone else think? Should we wait for this Hive fix?
>
> On Wed, Oct 27, 2021 at 3:17 AM OpenInx  wrote:
>
>> I think we will need to fix this critical iceberg bug before we release
>> the 0.12.1: https://github.com/apache/iceberg/issues/3393 . Let's mark
>> it as a blocker for the 0.12.1.
>>
>> On Fri, Oct 22, 2021 at 3:22 AM Kyle Bendickson  wrote:
>>
>>> Thank you everybody for the additional PRs brought up so far.
>>>
>>> I’ve volunteered to be release manager, so will be doing my best to go
>>> through and ensure these are prioritized for consideration (if some are
>>> truly new features they might need to wait for 0.13.0, but as I’m just the
>>> release manager that will be mire up to the community).
>>>
>>> If any committers or contributors have free cycles and are willing to
>>> review some of these PRs, that would be greatly appreciated!
>>>
>>> - Kyle Bendickson [@kbendick]
>>>
>>> On Thu, Oct 21, 2021 at 11:19 AM Peter Vary 
>>> wrote:
>>>
>>>> Just to make this clean https://github.com/apache/iceberg/pull/3338 fixes
>>>> the issue caused by https://github.com/apache/iceberg/pull/2565. The
>>>> fix will make Catalogs.loadCatalog consistent with Catalogs.hiveCatalog,
>>>> and fixing create table issues when no catalog is set in the config
>>>>
>>>> On 2021. Oct 21., at 16:59, Peter Vary  wrote:
>>>>
>>>> I would like to have this in 0.12.1:
>>>> https://github.com/apache/iceberg/pull/3338
>>>>
>>>> This breaks Hive queries, if no catalog is set, but this still needs to
>>>> be reviewed before merge.
>>>>
>>>> Thanks, Peter
>>>>
>>>>
>>>> On Thu, 21 Oct 2021, 07:12 Rajarshi Sarkar, 
>>>> wrote:
>>>>
>>>>> Hope this can get in: https://github.com/apache/iceberg/pull/3175
>>>>>
>>>>> Regards,
>>>>> Rajarshi Sarkar,
>>>>>
>>>>>
>>>>> On Thu, Oct 21, 2021 at 9:08 AM Cheng Pan  wrote:
>>>>>
>>>>>> Hope this can get in.
>>>>>> https://github.com/apache/iceberg/pull/3203
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 21, 2021 at 11:34 AM Reo Lei  wrote:
>>>>>>
>>>>>>> Thanks Kyle for syncing this!
>>>>>>>
>>>>>>> I think PR#3240 should be include in this release. Because in our
>>>>>>> Dingding group, we have received feedback from many flink users that 
>>>>>>> they
>>>>>>> encountered this problem. I think this PR is very important and we need 
>>>>>>> to
>>>>>>> fix this problem ASAP.
>>>>>>>
>>>>>>> link: https://github.com/apache/iceberg/pull/3240
>>>>>>>
>>>>>>> BR,
>>>>>>> Reo LEI
>>>>>>>
>>>>>>> Kyle Bendickson  于2021年10月21日周四 上午2:52写道:
>>>>>>>
>>>>>>>> As mentioned in today's community sync up, we're planning on
>>>>>>>> releasing a new point version of Iceberg - Apache Iceberg 0.12.1.
>>>>>>>>
>>>>>>>> If there are any outstanding bugs you'd like to include fixes for
>>>>>>>> or other minor patches, please respond to this email thread letting us 
>>>>>>>> know.
>>>>>>>>
>>>>>>>> The current list of patches to be included can be found in the
>>>>>>>> milestone on Github:
>>>>>>>> https://github.com/apache/iceberg/milestone/15?closed=1
>>>>>>>>
>>>>>>>> As new items are added, they will be included in the milestone.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Kyle Bendickson [ Github: @kbendick ]
>>>>>>>>
>>>>>>>
>>>>
>
> --
> Ryan Blue
> Tabular
>


Re: Iceberg 0.12.1 Patch Release - Call for Bug Fixes and Patches

2021-10-27 Thread OpenInx
I think we will need to fix this critical iceberg bug before we release the
0.12.1: https://github.com/apache/iceberg/issues/3393 . Let's mark it as a
blocker for the 0.12.1.

On Fri, Oct 22, 2021 at 3:22 AM Kyle Bendickson  wrote:

> Thank you everybody for the additional PRs brought up so far.
>
> I’ve volunteered to be release manager, so will be doing my best to go
> through and ensure these are prioritized for consideration (if some are
> truly new features they might need to wait for 0.13.0, but as I’m just the
> release manager that will be mire up to the community).
>
> If any committers or contributors have free cycles and are willing to
> review some of these PRs, that would be greatly appreciated!
>
> - Kyle Bendickson [@kbendick]
>
> On Thu, Oct 21, 2021 at 11:19 AM Peter Vary 
> wrote:
>
>> Just to make this clean https://github.com/apache/iceberg/pull/3338 fixes
>> the issue caused by https://github.com/apache/iceberg/pull/2565. The fix
>> will make Catalogs.loadCatalog consistent with Catalogs.hiveCatalog, and
>> fixing create table issues when no catalog is set in the config
>>
>> On 2021. Oct 21., at 16:59, Peter Vary  wrote:
>>
>> I would like to have this in 0.12.1:
>> https://github.com/apache/iceberg/pull/3338
>>
>> This breaks Hive queries, if no catalog is set, but this still needs to
>> be reviewed before merge.
>>
>> Thanks, Peter
>>
>>
>> On Thu, 21 Oct 2021, 07:12 Rajarshi Sarkar,  wrote:
>>
>>> Hope this can get in: https://github.com/apache/iceberg/pull/3175
>>>
>>> Regards,
>>> Rajarshi Sarkar,
>>>
>>>
>>> On Thu, Oct 21, 2021 at 9:08 AM Cheng Pan  wrote:
>>>
 Hope this can get in.
 https://github.com/apache/iceberg/pull/3203

 Thanks,
 Cheng Pan


 On Thu, Oct 21, 2021 at 11:34 AM Reo Lei  wrote:

> Thanks Kyle for syncing this!
>
> I think PR#3240 should be include in this release. Because in our
> Dingding group, we have received feedback from many flink users that they
> encountered this problem. I think this PR is very important and we need to
> fix this problem ASAP.
>
> link: https://github.com/apache/iceberg/pull/3240
>
> BR,
> Reo LEI
>
> Kyle Bendickson  于2021年10月21日周四 上午2:52写道:
>
>> As mentioned in today's community sync up, we're planning on
>> releasing a new point version of Iceberg - Apache Iceberg 0.12.1.
>>
>> If there are any outstanding bugs you'd like to include fixes for or
>> other minor patches, please respond to this email thread letting us know.
>>
>> The current list of patches to be included can be found in the
>> milestone on Github:
>> https://github.com/apache/iceberg/milestone/15?closed=1
>>
>> As new items are added, they will be included in the milestone.
>>
>> Best,
>> Kyle Bendickson [ Github: @kbendick ]
>>
>
>>


Re: Meeting Minutes from 10/20 Iceberg Sync

2021-10-22 Thread OpenInx
Thanks for the detailed report !

One more thing:  We now have made a lot of progress in integrating Alibaba
Cloud (https://www.aliyun.com/), Please see
https://github.com/apache/iceberg/projects/21 (Thanks @xingbowu -
https://github.com/xingbowu).

On Thu, Oct 21, 2021 at 11:30 PM Sam Redai  wrote:

> Good Morning Everyone,
>
> Here are the minutes from our Iceberg Sync that took place on October
> 20th, 9am-10am PT. Please remember that anyone can join the discussion so
> feel free to share the Iceberg-Sync
>  google group with anyone who
> is seeking an invite. As usual, the notes and the agenda are posted in the 
> live
> doc
> 
>  that's
> also attached to the meeting invitation.
>
> We covered a lot of topics...here we go!:
>
> Top of the Meeting Highlights
>
>-
>
>Sort based compaction - This is finished, reviewed, and merged. When
>you compact data files, you can now also have Spark re-sort them, either by
>the table’s sort order or the sort order given when you create the
>compaction job.
>-
>
>Spark build refactor: Thank you to Jack for getting us started on the
>Spark build refactor and also thanks to Anton for reviewing and helping get
>these changes in. We’ve gone with a variant of option 3 from our last
>discussions where we include all of the spark modules in our build but make
>it easy to turn them off. This way we can get the CI to run Spark, Hive,
>and Flink tests separately and only if necessary.
>-
>
>Delete files implementation for ORC: Thanks to Peter for adding
>builders to store deletes in ORC (previously we could only store deletes in
>Parquet or Avro). This means we now have support for all 3 formats for this
>feature.
>-
>
>Flink Update: We’ve updated Flink to 1.13 so we’re back on a supported
>version. 1.14 is out this week so we can aim to move to that at some point.
>
> Iceberg 0.12.1 Upcoming Patch Release (milestone
> )
>
>-
>
>Fix for the parquet map projection bug
>-
>
>Fix Flink CDC bug
>-
>
>A few other fixes that we also want to get out to the community so
>we’re going to start a release candidate as soon as possible
>-
>
>Kyle will start a thread in the general slack channel so everyone
>please feel free to mention any additional fixes that they want to see in
>this patch release
>
> Snapshot Releases
>
>-
>
>Eduard will tackle adding snapshot releases
>-
>
>In our deploy.gradle file, it’s setup to deploy to the snapshot
>repository
>-
>
>May require certain credentials so it may be required to reach out to
>the ASF infrastructure team
>
> Iceberg 0.13.0 Upcoming Release
>
>-
>
>There’s agreement to switch to a time based release schedule so the
>next release is roughly mid-November
>-
>
>Jack will cut a branch close to that time and any features that aren’t
>in yet will be pushed to the next release
>-
>
>We agree not to hold up releases to squeeze features in and prefer
>instead to aim for releasing sooner the next time
>
> Adding v3.2 to Spark Build Refactoring
>
>-
>
>Russell and Anton will coordinate on dropping in a Spark 3.2 module
>-
>
>We currently have 3.1 in the `spark3` module. We’ll move that out to
>its own module and mirror what we do with the 3.2 module. (This will enable
>cleaning up some mixed 3.0/3.1 code)
>
> Merge on Read
>
>-
>
>Anton has a bunch of PRs ready to queue up to contribute their
>internal implementation. (Russell will work with him)
>-
>
>This feature will allow for a much lower write amplification
>-
>
>The expectation is that in Spark 3.3 we can rely on Spark’s internal
>merge on read
>
> Snapshot Tagging (design doc
> )
> (PR #3104 )
>
>-
>
>We just had a meeting on Monday about that and made some conclusions
>and designs, so anyone who is interested please take a look.
>-
>
>Next steps are to add the feature in the stack and Jack already has a
>WIP implementation into the table metadata class
>
> Delete Compaction (design doc
> 
> )
>
>-
>
>Discussion happening at 5pm ET on 10/21 5-6pm PT for anyone interested
>(meeting link )
>-
>
>Some more discussion is needed to hone in on a final design choice.
>There are a few options that each have their own pros and cons.
>
> The New Source Interface for Flink (FLIP-27
> 

Re: Snapshot tagging, branching and retention

2021-10-13 Thread OpenInx
Is it possible to maintain a meeting note for this and publish it to the
mail list because I don't think everybody could attend this meeting ?

Thanks.

On Thu, Oct 14, 2021 at 2:00 AM Jack Ye  wrote:

> Hi everyone,
>
> Based on some offline discussions with different people around
> availability, we will hold the meeting on Monday 10/18 9am PDT.
>
> Here is the meeting link: meet.google.com/ubj-kvfm-ehg
>
> I have added all the people in this thread to the invite. Feel free to
> also forward the meeting to anyone else interested.
>
> Best,
> Jack Ye
>
> On Mon, Oct 11, 2021 at 8:53 AM Eduard Tudenhoefner 
> wrote:
>
>> Hey Jack,
>>
>> would this week on Wednesday work for you from 9 to 10am PDT?
>>
>> On Thu, Oct 7, 2021 at 7:41 PM Jack Ye  wrote:
>>
>>> Hi everyone,
>>>
>>> We have had a few iterations of the design doc with various people,
>>> thanks for all the feedback. I am thinking about a meeting to finalize the
>>> design and move forward with implementation.
>>>
>>> Considering the various time zones, I propose we choose any time from
>>> Tuesday (10/12) to Friday (10/15), 8-10am PDT, 1 hour meeting slot.
>>>
>>> If anyone is interested in joining, please let me know the preferred
>>> time slot.
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>> On Wed, Sep 15, 2021 at 11:29 PM Eduard Tudenhoefner 
>>> wrote:
>>>
 Nice work Jack, the proposal looks really good.

 On Sun, Aug 29, 2021 at 9:20 AM Jack Ye  wrote:

> Hi everyone,
>
> Recently I have published PR 2961 - add snapshot tags interface (
> https://github.com/apache/iceberg/pull/2961) and received a lot of
> great feedback. I have summarized everything in the discussions and put up
> a design to discuss the path forward around snapshot tagging, branching 
> and
> retention:
>
>
> https://docs.google.com/document/d/1PvxK_0ebEoX3s7nS6-LOJJZdBYr_olTWH9oepNUfJ-A/edit?usp=sharing
>
> Any feedback around the doc would be much appreciated!
>
> Also, to facilitate future changes in Iceberg spec, it would be very
> helpful to take a look at 2597 - Core: introduce TableMetadataBuilder (
> https://github.com/apache/iceberg/pull/2957) which would make
> changing TableMetadata much simpler.
>
> Thanks,
> Jack Ye
>



Re: Iceberg sync times

2021-10-09 Thread OpenInx
Thanks Ryan for bringing this up !  I attended several Iceberg syncs at 5
PM pacific time (9AM CST) and attended only one Iceberg sync at 9AM pacific
time (1 AM CST), and have the following feelings:

1.  We usually arrive at the office around 9:30AM to 10AM CST ( 5:30 PM ~
6:00 PM pacific time).  So if I plan to attend the Iceberg sync,  I usually
arrive at the office an hour early ( about 8:30 AM CST , it's 4:30 PM
pacific time I think) .   It's not a big deal for us, we're all happy with
it, you know.  (I've been absent from the last few iceberg syncs because of
some personal things , otherwise I could have provided more input from
Flink and the Chinese Iceberg community,  I'm sorry for that) .  I think I
will keep attending the following iceberg sync if we still plan to schedule
it at 5 PM pacific time.
2.  For the experience about iceberg sync at 9AM pacific time ( it's 1AM
CST),  I remember that I intentionally set a reminder alarm on my phone
before going to bed early,  and then got up to attend the iceberg sync when
the alarm was ringing.That was quite an exhausting experience for me,
so I turned to check the iceberg sync notes and mail list to know what has
happened in the past sync since that time.

>From my personal perspective,   If the community thinks scheduling the
iceberg sync at 9AM pacific time can attract more people to attend the
meetings and have a better community discussion,  I'm okay with it
(although I don't think I can attend this) if we could provide a detailed
sync note or a screen recording to show the context.  I think I will
comment on the sync notes so that others from the community could know the
latest progress or requirements or thoughts from our side.

BTW,   Chinese have a big DingDing(https://m.dingtalk.com/) group (1300+
people) to discuss iceberg issues or questions,  if others are interested
to know what the Chinese users are thinking about or what are the issues
that they are encountering, joining the DingDing group is welcome !

It's group number is:   31584735 . You can also scan the following QR code
to join the group:

[image: image.png]



On Sat, Oct 9, 2021 at 4:18 AM Ryan Blue  wrote:

> Hi everyone,
>
> Up to now, we've been scheduling every other Iceberg sync at 5 PM pacific
> time to make it easier for people in UTC+8 or similar time zones can
> participate. But, I've noticed that we hardly ever have people from those
> time zones joining, and quite a bit fewer people from pacific time join. On
> the other hand, we have lots of people joining the meetings at 9 AM pacific
> time.
>
> I think that given the current participation, we should stop holding every
> other sync at 5 pm and schedule all of them at 9 am to reach the most
> people. Does this sound reasonable?
>
> I know that there are probably quite a few people in China interested in a
> sync in CST / UTC+8, so I think we should also explore ways to make those
> easier for people to attend. Would another time or service work better?
> What about having a local sync set up and run by the part of the community
> in China?
>
> Ryan
>
> --
> Ryan Blue
> Tabular
>


Re: [DISCUSS] Spark version support strategy

2021-10-07 Thread OpenInx
> We should probably add a section to our Flink docs that explains and
links to Flink’s support policy and has a table of Iceberg versions that
work with Flink versions. (We should probably have the same table for
Spark, too!)

Thanks Ryan for the suggestion, I created a separate issue to address this
thing before: https://github.com/apache/iceberg/issues/3115 .  I will make
this forward.

On Thu, Oct 7, 2021 at 1:55 PM Jack Ye  wrote:

> Hi everyone,
>
> I tried to prototype option 3, here is the PR:
> https://github.com/apache/iceberg/pull/3237
>
> Sorry I did not see that Anton is planning to do it, but anyway it's just
> a draft, so feel free to just use it as reference.
>
> Best,
> Jack Ye
>
> On Sun, Oct 3, 2021 at 2:19 PM Ryan Blue  wrote:
>
>> Thanks for the context on the Flink side! I think it sounds reasonable to
>> keep up to date with the latest supported Flink version. If we want, we
>> could later go with something similar to what we do for Spark but we’ll see
>> how it goes and what the Flink community needs. We should probably add a
>> section to our Flink docs that explains and links to Flink’s support policy
>> and has a table of Iceberg versions that work with Flink versions. (We
>> should probably have the same table for Spark, too!)
>>
>> For Spark, I’m also leaning toward the modified option 3 where we keep
>> all of the code in the main repository but only build with one module at a
>> time by default. It makes sense to switch based on modules — rather than
>> selecting src paths within a module — so that it is easy to run a build
>> with all modules if you choose to — for example, when building release
>> binaries.
>>
>> The reason I think we should go with option 3 is for testing. If we have
>> a single repo with api, core, etc. and spark then changes to the common
>> modules can be tested by CI actions. Updates to individual Spark modules
>> would be completely independent. There is a slight inconvenience that when
>> an API used by Spark changes, the author would still need to fix multiple
>> Spark versions. But the trade-off is that with a separate repository like
>> option 2, changes that break Spark versions are not caught and then the
>> Spark repository’s CI ends up failing on completely unrelated changes. That
>> would be a major pain, felt by everyone contributing to the Spark
>> integration, so I think option 3 is the best path forward.
>>
>> It sounds like we probably have some agreement now, but please speak up
>> if you think another option would be better.
>>
>> The next step is to prototype the build changes to test out option 3. Or
>> if you prefer option 2, then prototype those changes as well. I think that
>> Anton is planning to do this, but if you have time and the desire to do it
>> please reach out and coordinate with us!
>>
>> Ryan
>>
>> On Wed, Sep 29, 2021 at 9:12 PM Steven Wu  wrote:
>>
>>> Wing, sorry, my earlier message probably misled you. I was speaking my
>>> personal opinion on Flink version support.
>>>
>>> On Tue, Sep 28, 2021 at 8:03 PM Wing Yew Poon
>>>  wrote:
>>>
>>>> Hi OpenInx,
>>>> I'm sorry I misunderstood the thinking of the Flink community. Thanks
>>>> for the clarification.
>>>> - Wing Yew
>>>>
>>>>
>>>> On Tue, Sep 28, 2021 at 7:15 PM OpenInx  wrote:
>>>>
>>>>> Hi Wing
>>>>>
>>>>> As we discussed above, we community prefer to choose option.2 or
>>>>> option.3.  So in fact, when we planned to upgrade the flink version from
>>>>> 1.12 to 1.13,  we are doing our best to guarantee the master iceberg repo
>>>>> could work fine for both flink1.12 & flink1.13. More context please see
>>>>> [1], [2], [3]
>>>>>
>>>>> [1] https://github.com/apache/iceberg/pull/3116
>>>>> [2] https://github.com/apache/iceberg/issues/3183
>>>>> [3]
>>>>> https://lists.apache.org/x/thread.html/ra438e89eeec2d4623a32822e21739c8f2229505522d73d1034e34198@%3Cdev.flink.apache.org%3E
>>>>>
>>>>>
>>>>> On Wed, Sep 29, 2021 at 5:27 AM Wing Yew Poon
>>>>>  wrote:
>>>>>
>>>>>> In the last community sync, we spent a little time on this topic. For
>>>>>> Spark support, there are currently two options under consideration:
>>>>>>
>>>>>> Option 2: Separate repo for the Spark support. Use branches for
>>>>>> supporting differe

Re: [DISCUSS] Spark version support strategy

2021-09-28 Thread OpenInx
oudera.com.INVALID> wrote:
>>>>>>>
>>>>>>> IIUC, Option 2 is to move the Spark support for Iceberg into a
>>>>>>> separate repo (subproject of Iceberg). Would we have branches such as
>>>>>>> 0.13-2.4, 0.13-3.0, 0.13-3.1, and 0.13-3.2? For features that can be
>>>>>>> supported in all versions or all Spark 3 versions, then we would need to
>>>>>>> commit the changes to all applicable branches. Basically we are trading
>>>>>>> more work to commit to multiple branches for simplified build and CI
>>>>>>> time per branch, which might be an acceptable trade-off. However, the
>>>>>>> biggest downside is that changes may need to be made in core Iceberg as
>>>>>>> well as in the engine (in this case Spark) support, and we need to wait 
>>>>>>> for
>>>>>>> a release of core Iceberg to consume the changes in the subproject. In 
>>>>>>> this
>>>>>>> case, maybe we should have a monthly release of core Iceberg (no matter 
>>>>>>> how
>>>>>>> many changes go in, as long as it is non-zero) so that the subproject 
>>>>>>> can
>>>>>>> consume changes fairly quickly?
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 15, 2021 at 2:09 PM Ryan Blue  wrote:
>>>>>>>
>>>>>>>> Thanks for bringing this up, Anton. I’m glad that we have the set
>>>>>>>> of potential solutions well defined.
>>>>>>>>
>>>>>>>> Looks like the next step is to decide whether we want to require
>>>>>>>> people to update Spark versions to pick up newer versions of Iceberg. 
>>>>>>>> If we
>>>>>>>> choose to make people upgrade, then option 1 is clearly the best 
>>>>>>>> choice.
>>>>>>>>
>>>>>>>> I don’t think that we should make updating Spark a requirement.
>>>>>>>> Many of the things that we’re working on are orthogonal to Spark 
>>>>>>>> versions,
>>>>>>>> like table maintenance actions, secondary indexes, the 1.0 API, views, 
>>>>>>>> ORC
>>>>>>>> delete files, new storage implementations, etc. Upgrading Spark is time
>>>>>>>> consuming and untrusted in my experience, so I think we would be 
>>>>>>>> setting up
>>>>>>>> an unnecessary trade-off between spending lots of time to upgrade 
>>>>>>>> Spark and
>>>>>>>> picking up new Iceberg features.
>>>>>>>>
>>>>>>>> Another way of thinking about this is that if we went with option
>>>>>>>> 1, then we could port bug fixes into 0.12.x. But there are many things 
>>>>>>>> that
>>>>>>>> wouldn’t fit this model, like adding a FileIO implementation for ADLS. 
>>>>>>>> So
>>>>>>>> some people in the community would have to maintain branches of newer
>>>>>>>> Iceberg versions with older versions of Spark outside of the main 
>>>>>>>> Iceberg
>>>>>>>> project — that defeats the purpose of simplifying things with option 1
>>>>>>>> because we would then have more people maintaining the same 0.13.x with
>>>>>>>> Spark 3.1 branch. (This reminds me of the Spark community, where we 
>>>>>>>> wanted
>>>>>>>> to release a 2.5 line with DSv2 backported, but the community decided 
>>>>>>>> not
>>>>>>>> to so we built similar 2.4+DSv2 branches at Netflix, Tencent, Apple, 
>>>>>>>> etc.)
>>>>>>>>
>>>>>>>> If the community is going to do the work anyway — and I think some
>>>>>>>> of us would — we should make it possible to share that work. That’s 
>>>>>>>> why I
>>>>>>>> don’t think that we should go with option 1.
>>>>>>>>
>>>>>>>> If we don’t go with option 1, then the choice is how to maintain
>>>>>>>> multiple Spark versions. I think that the way we’re doing it right now 
>>>>>>>

Re: can not use iceberg as a sql source in flink sql according to iceberg 0.12.0

2021-09-22 Thread OpenInx
Hi Joshua

Can you check what's the parquet version you are using ?   Looks like the
line 112 in HadoopReadOptions is not the first line accessing the variables
in ParquetInputFormat.

[image: image.png]

On Wed, Sep 22, 2021 at 11:07 PM Joshua Fan  wrote:

> Hi
> I am glad to use iceberg as table source in flink sql, flink version is
> 1.13.2, and iceberg version is 0.12.0.
>
> After changed the flink version from 1.12 to 1.13, and changed some codes
> in FlinkCatalogFactory, the project can be build successfully.
>
> First, I tried to write data into iceberg by flink sql, and it seems go
> well. And then I want to verify the data, so I want to read from iceberg
> table, I wrote a
> simple sql, like "select * from
> iceberg_catalog.catalog_database.catalog_table", the sql can be submitted,
> but the flink job kept restarting by 'java.lang.NoClassDefFoundError:
> org/apache/iceberg/shaded/org/apache/parquet/hadoop/ParquetInputFormat'.
> But, actually, ParquetInputFormat was in the iceberg-flink-runtime-0.12.0.jar.
> Has no idea why this can happen.
> The full stack trace is below:
> java.lang.NoClassDefFoundError:
> org/apache/iceberg/shaded/org/apache/parquet/hadoop/ParquetInputFormat
> at
> org.apache.iceberg.shaded.org.apache.parquet.HadoopReadOptions$Builder.(HadoopReadOptions.java:112)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.shaded.org.apache.parquet.HadoopReadOptions$Builder.(HadoopReadOptions.java:97)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.shaded.org.apache.parquet.HadoopReadOptions.builder(HadoopReadOptions.java:85)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.parquet.Parquet$ReadBuilder.build(Parquet.java:793)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.RowDataIterator.newParquetIterable(RowDataIterator.java:135)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.RowDataIterator.newIterable(RowDataIterator.java:86)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.RowDataIterator.openTaskIterator(RowDataIterator.java:74)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.DataIterator.updateCurrentIterator(DataIterator.java:102)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.DataIterator.hasNext(DataIterator.java:84)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.iceberg.flink.source.FlinkInputFormat.reachedEnd(FlinkInputFormat.java:104)
> ~[iceberg-flink-runtime-0.12.0-qihoo.jar:?]
> at
> org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:89)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:110)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:66)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> at
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask$LegacySourceFunctionThread.run(SourceStreamTask.java:269)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> You can see that the HadoopReadOptions can be found.
>
> Any help will be appricated. Thank you.
>
> Yours sincerely
>
> Josh
>


Re: [DISCUSS] Iceberg roadmap

2021-09-18 Thread OpenInx
Thanks Steven & Kyle.

Yes,  the flip-27 source and flink 1.13.2 are orthogonal because the
flink's flip-27 API  was successfully introduced in flink 1.12 release (
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface).
The WIP flip-27 iceberg source proposed from Steven is also created on top
of flink 1.12.  There are two different things.  The flip27 source has
great value when people want to replace their kafka with iceberg tables to
accomplish historical data backfill & bootstrap , I believe the guys from
netflix has clearly explained the value & best practise in this talk (
https://www.youtube.com/watch?v=rtz3p_iijP8=youtu.be_channel=NetflixData)
.  So I agree with Steven to put the flip-27 source into the priority 2 in
our apache iceberg roadmap.

On Sat, Sep 18, 2021 at 4:41 AM Kyle Bendickson
 wrote:

> This list looks overall pretty good to me. +1
>
> For Flink 1.13 upgrade, I suggest we consider starting another thread for
> it. There are some open PRs, but they have outstanding questions.
> Specifically, dropping support for Flink 1.12 or not. I think we can
> upgrade without dropping support for Flink 1.12, but we wouldn’t get some
> of the proposed benefits of 1.13 (though that can be a follow up task).
>
> I’m not presently involved in the Flink Community enough to say with
> certainty, but I believe the FLIP-27 (Using the new source interface) and
> the Flink 1.13.2 upgrade are orthogonal to each other and can both progress
> independently. But I would defer to Steven or anybody else who works with
> Flink much more often than I do currently.
>
> - Kyle Bendickson
>
> On Sep 15, 2021, at 4:06 PM, Ryan Blue  wrote:
>
> That sounds great, thanks for taking that on Jack!
>
> On Wed, Sep 15, 2021 at 3:51 PM Jack Ye  wrote:
>
>> For external Trino and PrestoDB tasks, I am thinking about creating one
>> Github project for Trino and another one for PrestoDB to manage all tasks
>> under them, adding links of issues and PRs in the other communities to
>> track progress. This is mostly to improve visibility so that people who are
>> interested can see what is going on in those 2 places.
>>
>> -Jack Ye
>>
>> On Wed, Sep 15, 2021 at 2:14 PM Ryan Blue  wrote:
>>
>>> Gidon, I think that the v3 part of encryption is actually documenting
>>> how it works and adding it to the spec. Right now we have hooks for
>>> building some encryption around it, but almost no requirements in the spec
>>> for how to use it across implementations. This is fine while we're working
>>> on defining encryption, but we eventually want to update the spec.
>>>
>>> Jack, I'm happy to add the external PrestoDB items to the roadmap. I'm
>>> just not quite sure what to do here since we aren't tracking them in the
>>> Iceberg community ourselves. I listed those as external so that we can
>>> publish links to where those are tracked in other communities. We can add
>>> as many of these as we want.
>>>
>>> Anton, I agree. The goal here is to identify the top priority items to
>>> help direct review effort. We want everything to continue progressing, but
>>> I think it's good to identify where we as a community want to focus review
>>> time.
>>>
>>> Sounds like one area of uncertainty is FLIP-27 vs Flink 1.13.2. Can
>>> someone summarize the status of Flink and what we need? I don't think I
>>> understand it well enough to suggest which one takes priority.
>>>
>>> Ryan
>>>
>>> On Mon, Sep 13, 2021 at 7:54 PM Anton Okolnychyi <
>>> aokolnyc...@apple.com.invalid> wrote:
>>>
 The discussed roadmap makes sense to me. I think it is important to
 agree on what we should do first as the review pool is limited. There are
 more and more large items that are half done or half discussed. I think we
 better focus on finishing them quickly and then move to something else as
 opposed to making very minor progress on a number of issues.

 To be clear, it is not like other things are not important or we should
 stop their development. It is more about making sure certain high-priority
 features for most folks in the community get enough attention.

 - Anton

 On 13 Sep 2021, at 12:19, Jack Ye  wrote:

 I'd like to also propose adding the following in the external section:
 1. the PrestoDB equivalent for each item listed for Trino. I am not
 sure what's the best way to track them, but I feel it's better to list and
 track them separately. I have talked with related people currently
 maintaining the PrestoDB Iceberg connector (mostly in Twitter), and they
 would like to take a different route from Trino to fully remove Hive
 dependencies in the connector. This means the 2 connectors will likely
 diverge in implementation in the near future.
 2. adding a medium item for Trino and PrestoDB Avro support
 3. adding a small item for Trino and PrestoDB full system table support
 (the system table schema in them are diverging from 

Re: [DISCUSS] Spark version support strategy

2021-09-15 Thread OpenInx
Thanks for bringing this up,  Anton.

Everyone has great pros/cons to support their preferences.  Before giving
my preference, let me raise one question:what's the top priority thing
for apache iceberg project at this point in time ?  This question will help
us to answer the following question: Should we support more engine versions
more robustly or be a bit more aggressive and concentrate on getting the
new features that users need most in order to keep the project more
competitive ?

If people watch the apache iceberg project and check the issues &
PR frequently,  I guess more than 90% people will answer the priority
question:   There is no doubt for making the whole v2 story to be
production-ready.   The current roadmap discussion also proofs the thing :
https://lists.apache.org/x/thread.html/r84e80216c259c81f824c6971504c321cd8c785774c489d52d4fc123f@%3Cdev.iceberg.apache.org%3E
.

In order to ensure the highest priority at this point in time, I will
prefer option-1 to reduce the cost of engine maintenance, so as to free up
resources to make v2 production-ready.

On Wed, Sep 15, 2021 at 3:00 PM Saisai Shao  wrote:

> From Dev's point, it has less burden to always support the latest version
> of Spark (for example). But from user's point, especially for us who
> maintain Spark internally, it is not easy to upgrade the Spark version for
> the first time (since we have many customizations internally), and we're
> still promoting to upgrade to 3.1.2. If the community ditches the support
> of old version of Spark3, users have to maintain it themselves unavoidably.
>
> So I'm inclined to make this support in community, not by users
> themselves, as for Option 2 or 3, I'm fine with either. And to relieve the
> burden, we could support limited versions of Spark (for example 2 versions).
>
> Just my two cents.
>
> -Saisai
>
>
> Jack Ye  于2021年9月15日周三 下午1:35写道:
>
>> Hi Wing Yew,
>>
>> I think 2.4 is a different story, we will continue to support Spark 2.4,
>> but as you can see it will continue to have very limited functionalities
>> comparing to Spark 3. I believe we discussed about option 3 when we were
>> doing Spark 3.0 to 3.1 upgrade. Recently we are seeing the same issue for
>> Flink 1.11, 1.12 and 1.13 as well. I feel we need a consistent strategy
>> around this, let's take this chance to make a good community guideline for
>> all future engine versions, especially for Spark, Flink and Hive that are
>> in the same repository.
>>
>> I can totally understand your point of view Wing, in fact, speaking from
>> the perspective of AWS EMR, we have to support over 40 versions of the
>> software because there are people who are still using Spark 1.4, believe it
>> or not. After all, keep backporting changes will become a liability not
>> only on the user side, but also on the service provider side, so I believe
>> it's not a bad practice to push for user upgrade, as it will make the life
>> of both parties easier in the end. New feature is definitely one of the
>> best incentives to promote an upgrade on user side.
>>
>> I think the biggest issue of option 3 is about its scalability, because
>> we will have an unbounded list of packages to add and compile in the
>> future, and we probably cannot drop support of that package once created.
>> If we go with option 1, I think we can still publish a few patch versions
>> for old Iceberg releases, and committers can control the amount of patch
>> versions to guard people from abusing the power of patching. I see this as
>> a consistent strategy also for Flink and Hive. With this strategy, we can
>> truly have a compatibility matrix for engine versions against Iceberg
>> versions.
>>
>> -Jack
>>
>>
>>
>> On Tue, Sep 14, 2021 at 10:00 PM Wing Yew Poon
>>  wrote:
>>
>>> I understand and sympathize with the desire to use new DSv2 features in
>>> Spark 3.2. I agree that Option 1 is the easiest for developers, but I don't
>>> think it considers the interests of users. I do not think that most users
>>> will upgrade to Spark 3.2 as soon as it is released. It is a "minor
>>> version" upgrade in name from 3.1 (or from 3.0), but I think we all know
>>> that it is not a minor upgrade. There are a lot of changes from 3.0 to 3.1
>>> and from 3.1 to 3.2. I think there are even a lot of users running Spark
>>> 2.4 and not even on Spark 3 yet. Do we also plan to stop supporting Spark
>>> 2.4?
>>>
>>> Please correct me if I'm mistaken, but the folks who have spoken out in
>>> favor of Option 1 all work for the same organization, don't they? And they
>>> don't have a problem with making their users, all internal, simply upgrade
>>> to Spark 3.2, do they? (Or they are already running an internal fork that
>>> is close to 3.2.)
>>>
>>> I work for an organization with customers running different versions of
>>> Spark. It is true that we can backport new features to older versions if we
>>> wanted to. I suppose the people contributing to Iceberg work for some
>>> organization or other 

Re: Iceberg community sync notes for 1 September 2021

2021-09-08 Thread OpenInx
One more thing:  I think it will be great to have the parquet bloom filter
feature (contributed from www.iq.com, one of the largest video websites in
China) supported in iceberg 0.13.0 :

1. https://github.com/apache/iceberg/pull/2643
2. https://github.com/apache/iceberg/pull/2642

On Thu, Sep 9, 2021 at 9:36 AM OpenInx  wrote:

> Thanks for the summary,  Ryan !
>
> I would like to add the following thing into the roadmap for 0.13.0:
>
> *Flink Integration*
>
> 1.  Upgrade the flink version from 1.12.1 to 1.13.2 (
> https://github.com/apache/iceberg/pull/2629).
>
> Because there is a bug in flink 1.12.1 when reading nested data types
> (Map/List) in flink SQL (see:
> https://github.com/apache/iceberg/pull/3081#pullrequestreview-747934199),
> the newly released 1.13.2 has resolved it.
>
> 2.  Support for creating an iceberg table with 'connector'='type' in flink
> SQL (https://github.com/apache/iceberg/pull/2666).
>
> The PR has been merged but still left a flink connector document open for
> reviewing (https://github.com/apache/iceberg/pull/3085).
>
> 3.  Add streaming upsert option for flink write sink. (
> https://github.com/apache/iceberg/pull/2863)
>
> This is an essential PR for flink upsert stream when writing to iceberg
> sink table, more background pls see
> https://github.com/apache/iceberg/pull/1996#issue-546072705.
>
> *Ecosystem/Vendor integration.*
>
> 1.  Aliyun OSS/DLF integration. (
> https://github.com/apache/iceberg/pull/2230)
>
> This is a very important job that has been suspended for a long time.  The
> good news is:  Xingbo Wu <https://github.com/xingbowu> now has enough
> bandwidth to make this forward now.  I think we can successfully finish
> this work If we've enough reviewing bandwidth.
>
> 2. Dell ECS integration.
>
> We have great discussion (https://github.com/apache/iceberg/pull/2807)
> about integrating the private vendor storage/catalog into apache iceberg
> repo, but I'm not sure it's suitable to add it into roadmap 0.13.0 before
> we reach the agreement about the unit/integration/release tests for private
> vendor integration.
>
>
> > Dan also suggested using github projects to track the progress of each
> feature.
>
> +1  ! We should make better use of github issues to manage the progress
> and blockers of our roadmap, so that everyone can synchronize to the latest
> status in time to make the roadmap forward.
>
>
> On Thu, Sep 9, 2021 at 7:58 AM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> The notes for the Iceberg community sync last week are now updated in the 
>> agenda/notes
>> doc
>> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.2umwfxbo0iwo>.
>> If you have anything to add, feel free to let me know or add comments to
>> the doc.
>>
>> We mainly discussed what projects we want to add to a roadmap and how to
>> track them. I'll be sending out a discussion thread with the roadmap
>> projects that we came up with so we can finalize it and add to it. Dan also
>> suggested using github projects to track the progress of each feature.
>>
>> If you'd like to attend the syncs, you can add yourself to the iceberg-sync
>> google group <https://groups.google.com/g/iceberg-sync> to receive the
>> invites. Everyone is welcome to attend!
>>
>> Here are the notes if you prefer this over going to the doc:
>>
>> 1 September 2021
>>
>>-
>>
>>Highlights
>>-
>>
>>   0.12.0 release is out (Thanks, Carl!)
>>   -
>>
>>   Metadata tables are updated for v2 (Thanks, Anton!)
>>   -
>>
>>   Stored procedure to add and dedup files (Thanks, Szehon!)
>>   -
>>
>>Releases
>>-
>>
>>   0.13.0 release timeline
>>   -
>>
>>  Jack will be RM
>>  -
>>
>>  Targeting late Oct or early Nov
>>  -
>>
>>   0.12.1
>>   -
>>
>>  Reads hanging <https://github.com/apache/iceberg/issues/3055> -
>>  need to find someone. Maybe Russell?
>>  -
>>
>>  Parquet 1.12.0 bug
>>  <https://github.com/apache/iceberg/issues/2962>- Thanks, Kyle!
>>  -
>>
>>Roadmap discussion
>>-
>>
>>   Tracking
>>   -
>>
>>  Dan: Github projects?
>>  -
>>
>>  Ryan: Markdown file on the site?
>>  -
>>
>>   Roadmap scope, items
>>   -
>>
>&

Re: Iceberg community sync notes for 1 September 2021

2021-09-08 Thread OpenInx
Thanks for the summary,  Ryan !

I would like to add the following thing into the roadmap for 0.13.0:

*Flink Integration*

1.  Upgrade the flink version from 1.12.1 to 1.13.2 (
https://github.com/apache/iceberg/pull/2629).

Because there is a bug in flink 1.12.1 when reading nested data types
(Map/List) in flink SQL (see:
https://github.com/apache/iceberg/pull/3081#pullrequestreview-747934199),
the newly released 1.13.2 has resolved it.

2.  Support for creating an iceberg table with 'connector'='type' in flink
SQL (https://github.com/apache/iceberg/pull/2666).

The PR has been merged but still left a flink connector document open for
reviewing (https://github.com/apache/iceberg/pull/3085).

3.  Add streaming upsert option for flink write sink. (
https://github.com/apache/iceberg/pull/2863)

This is an essential PR for flink upsert stream when writing to iceberg
sink table, more background pls see
https://github.com/apache/iceberg/pull/1996#issue-546072705.

*Ecosystem/Vendor integration.*

1.  Aliyun OSS/DLF integration. (https://github.com/apache/iceberg/pull/2230
)

This is a very important job that has been suspended for a long time.  The
good news is:  Xingbo Wu  now has enough
bandwidth to make this forward now.  I think we can successfully finish
this work If we've enough reviewing bandwidth.

2. Dell ECS integration.

We have great discussion (https://github.com/apache/iceberg/pull/2807)
about integrating the private vendor storage/catalog into apache iceberg
repo, but I'm not sure it's suitable to add it into roadmap 0.13.0 before
we reach the agreement about the unit/integration/release tests for private
vendor integration.


> Dan also suggested using github projects to track the progress of each
feature.

+1  ! We should make better use of github issues to manage the progress and
blockers of our roadmap, so that everyone can synchronize to the latest
status in time to make the roadmap forward.


On Thu, Sep 9, 2021 at 7:58 AM Ryan Blue  wrote:

> Hi everyone,
>
> The notes for the Iceberg community sync last week are now updated in the 
> agenda/notes
> doc
> .
> If you have anything to add, feel free to let me know or add comments to
> the doc.
>
> We mainly discussed what projects we want to add to a roadmap and how to
> track them. I'll be sending out a discussion thread with the roadmap
> projects that we came up with so we can finalize it and add to it. Dan also
> suggested using github projects to track the progress of each feature.
>
> If you'd like to attend the syncs, you can add yourself to the iceberg-sync
> google group  to receive the
> invites. Everyone is welcome to attend!
>
> Here are the notes if you prefer this over going to the doc:
>
> 1 September 2021
>
>-
>
>Highlights
>-
>
>   0.12.0 release is out (Thanks, Carl!)
>   -
>
>   Metadata tables are updated for v2 (Thanks, Anton!)
>   -
>
>   Stored procedure to add and dedup files (Thanks, Szehon!)
>   -
>
>Releases
>-
>
>   0.13.0 release timeline
>   -
>
>  Jack will be RM
>  -
>
>  Targeting late Oct or early Nov
>  -
>
>   0.12.1
>   -
>
>  Reads hanging  -
>  need to find someone. Maybe Russell?
>  -
>
>  Parquet 1.12.0 bug
>  - Thanks, Kyle!
>  -
>
>Roadmap discussion
>-
>
>   Tracking
>   -
>
>  Dan: Github projects?
>  -
>
>  Ryan: Markdown file on the site?
>  -
>
>   Roadmap scope, items
>   -
>
>  Snapshot tagging and branching - Jack, Ryan (reviews)
>  -
>
>  Encryption - Gidon, Jack, Yufei
>  -
>
>  Merge-on-read plans in Spark - Anton, Ryan (reviews)
>  -
>
> New writers
> -
>
>  Delete compaction - Junjie, Puneet
>  -
>
>  Python - probably publish a separate roadmap
>  -
>
> Separate google group
> 
> -
>
>  Views - Anjali, John
>  -
>
>  Secondary indexes - Miao, Guy, Jack (some reviews)
>  -
>
> File-level
> -
>
> Rollup
> -
>
>  Spark streaming - Sreeram, Kyle, Anton (reviews)
>  -
>
> CDC use case
> -
>
> Limit support to process large snapshots
> -
>
> CDC with Iceberg source
> -
>
>  [v3] Relative paths - Anurag, Yufei
>  -
>
>  [v3] Z-ordering - Russell
>  -
>
>  [v3] Default values in schemas - Owen
>  -
>
>  Format v2 support 

Re: [VOTE] Adopt the v2 spec changes

2021-07-28 Thread OpenInx
> adopt the pending v2 spec changes as the supported v2 spec

I assume this vote wants to reach the consistency between the community
members  that we won't introduce any breaking changes in v2 spec,  not
discuss exposing v2 to SQL tables like the following, right ?

CREATE TABLE prod.db.sample (
id bigint COMMENT 'unique id',
data string)
USING iceberg
TBLPROPERTIES (
'format.version' = '2'
);

If so,  then I will give a binding +1 from my side because I don't have
other particular changes that need to be introduced in the v2 now.

For exposing the v2 to end users,  I think we could also try to merge the
PR and clarify v2 as an experiential feature, because I found many people
are trying to test and benchmark the v2 feature based on the practice from
https://github.com/apache/iceberg/pull/2410. Using the
meta.upgradeToFormatVersion(2) was quite unfriendly for users to test this
v2 feature now.

Thanks.



On Wed, Jul 28, 2021 at 1:09 PM Jack Ye  wrote:

> (non-binding) +1, this looks like a clean cut to me. We have been testing
> v2 internally for quite a while now, hopefully it can become the new
> default version soon to enable row level delete and update.
>
> -Jack Ye
>
>
> On Tue, Jul 27, 2021 at 9:59 AM Ryan Blue  wrote:
>
>> I’d like to propose that we adopt the pending v2 spec changes as the
>> supported v2 spec. The full list of changes is documented in the v2
>> summary section of the spec .
>>
>> The major breaking change is the addition of delete files and metadata to
>> track delete files. In addition, there are a few other minor breaking
>> changes. For example, v2 drops the block_size_in_bytes field in
>> manifests that was previously required and also omits fields in table
>> metadata that are now tracked by lists; schema is no longer written in
>> favor of schemas. Other changes are forward compatible, mostly
>> tightening field requirements where possible (e.g., schemas and
>> current-schema-id are now required).
>>
>> Adopting the changes will signal that the community intends to support
>> the current set of changes and will guarantee forward-compatibility for v2
>> tables that implement the current v2 spec. Any new breaking changes would
>> go into v3.
>>
>> Please vote on adopting the v2 changes in the next 72 hours.
>>
>> [ ] +1 Adopt the changes as v2
>> [ ] +0
>> [ ] -1 Do not adopt the changes, because…
>> --
>> Ryan Blue
>>
>


Re: Welcoming Jack Ye as a new committer!

2021-07-06 Thread OpenInx
Congrats, Jack !

On Wed, Jul 7, 2021 at 7:40 AM Miao Wang  wrote:

> Congratulations!
>
> Miao
>
> Sent from my iPhone
>
> On Jul 5, 2021, at 4:14 PM, Daniel Weeks  wrote:
>
> 
> Great work Jack, Congratulations!
>
> On Mon, Jul 5, 2021 at 1:21 PM karuppayya 
> wrote:
>
>> Congratulations Jack!
>>
>> On Mon, Jul 5, 2021 at 1:14 PM Yufei Gu  wrote:
>>
>>> Congratulations, Jack! Thanks for the contribution!
>>>
>>> Best,
>>>
>>> Yufei
>>>
>>>
>>> On Mon, Jul 5, 2021 at 1:09 PM John Zhuge  wrote:
>>>
 Congratulations Jack!

 On Mon, Jul 5, 2021 at 12:57 PM Marton Bod 
 wrote:

> Congrats Jack!
>
> On Mon, Jul 5, 2021 at 9:54 PM Wing Yew Poon
>  wrote:
>
>> Congratulations Jack!
>>
>>
>> On Mon, Jul 5, 2021 at 11:35 AM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to welcome Jack Ye as a new Iceberg committer.
>>>
>>> Thanks for all your contributions, Jack!
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Blue
>>>
>> --
 John Zhuge

>>>


Re: Welcoming OpenInx as a new PMC member!

2021-06-29 Thread OpenInx
Thanks all !

I really appreciate the trust from the Apache iceberg community.  For me,
this is not only an honor, but also a responsibility.  I'd like to share
something about the current apache iceberg status in Asia:

In the past year, the apache iceberg was growing rapidly in the Asian
users.  Many internet  companies has forked their own branches to customize
their iceberg (take few examples but not all) services:

1.   Aliyun.com ( from alibaba) ,  we have successfully integrated apache
iceberg into our aliyun EMR services. And we are serving some customers
with a huge data scale (PB).
2.   Tencent.  The iceberg has been a very important infrastructure for
their internal users ,  besides Tencent also provides public cloud services
(https://intl.cloud.tencent.com) for the external customers.  There is a
post to share their flink+iceberg experience :
https://www.alibabacloud.com/blog/flink-%2B-iceberg-how-to-construct-a-whole-scenario-real-time-data-warehouse_597824
3.   Dell Inc.  They have integrated iceberg data lake to their on-prem
storage deployment ( which implements the aws s3 API so that people could
easily migrate their aws s3 data to the on-prem dell storage deployment if
they want to) for their customers.
https://www.infoq.cn/article/Pe9ejRJDrJsp5AIhjlE3
4.   Oppo ( https://www.oppo.com/en/),  one of the most biggest mobile
phone manufacturer  in the world,  has adopted the apache iceberg as their
internal data lake table format:
https://www.infoq.cn/article/kuyk9ieusyyxbq5loflu
5.   Netease (https://en.wikipedia.org/wiki/NetEase), has also adopted the
apache iceberg to service their business:
https://developpaper.com/netease-exploration-and-practice-of-flink-iceberg-data-lake/

We also had several offline iceberg meetup in Asia in the past year:

1.  DataFunSummit  Data Lake meetup:
https://mp.weixin.qq.com/s/bJldRWy3rg8su2jiV_-5xQ
https://mp.weixin.qq.com/s/Ax2Dr7w7RxWxMGyyGlAopg#at
2.  Flink x Iceberg Meetup in Shanghai :
https://zhuanlan.zhihu.com/p/361539420
3.  Apache Iceberg meetup in Shenzhen:
https://segmentfault.com/a/119024535102

At present, the iceberg community is developing well in Asia. I am very
happy to see that with our joint efforts in the future, the Apache Iceberg
& community can be better !


On Wed, Jun 30, 2021 at 10:39 AM John Zhuge  wrote:

> Congratulations!
>
> On Tue, Jun 29, 2021 at 7:32 PM wgcn.bj  wrote:
>
>> Congrats!
>>
>>  原始邮件
>> *发件人:* Dongjoon Hyun
>> *收件人:* dev
>> *发送时间:* 2021年6月30日(周三) 10:05
>> *主题:* Re: Welcoming OpenInx as a new PMC member!
>>
>> Congratulations!
>>
>> Dongjoon.
>>
>> On Tue, Jun 29, 2021 at 6:35 PM Forward Xu 
>> wrote:
>>
>>> Congratulations!
>>>
>>>
>>> best
>>>
>>> Forward
>>>
>>> Miao Wang  于2021年6月30日周三 上午8:25写道:
>>>
>>>> Congratulations!
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jun 29, 2021, at 4:57 PM, Steven Wu  wrote:
>>>>
>>>> 
>>>> Congrats!
>>>>
>>>> On Tue, Jun 29, 2021 at 2:12 PM Huadong Liu 
>>>> wrote:
>>>>
>>>>>
>>>>> Congrats Zheng!
>>>>>
>>>>>
>>>>> On Tue, Jun 29, 2021 at 1:52 PM Ryan Blue  wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I'd like to welcome OpenInx (Zheng Hu) as a new Iceberg PMC member.
>>>>>>
>>>>>> Thanks for all your contributions and commitment to the
>>>>>> project, OpenInx!
>>>>>>
>>>>>>
>>>>>> Ryan
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>>
>>>>> --
> John Zhuge
>


Re: Stableness of V2 Spec/API

2021-05-17 Thread OpenInx
The PR-2303 defines how the batch job does the compaction work,  the
PR-2308  decides what's the behavior that compaction txn and  row-delta txn
commit at the same time.   They should n't block each other,  but we will
need to resolve both of them.

On Tue, May 18, 2021 at 9:36 AM Huadong Liu  wrote:

> Thanks. Compaction is https://github.com/apache/iceberg/pull/2303 and it
> is currently blocked by https://github.com/apache/iceberg/issues/2308?
>
> On Mon, May 17, 2021 at 6:17 PM OpenInx  wrote:
>
>> Hi Huadong
>>
>> From the perspective of iceberg developers, we don't expose the format v2
>> to end users because we think there is still other work that needs to be
>> done. As you can see there are still some unfinished issues from your link.
>> As for whether v2 will cause data loss, from my perspective as a
>> designer, semantics and correctness should be handled very rigorously if we
>> don't do any compaction.  Once we introduce the compaction action,  we will
>> encounter this issue: https://github.com/apache/iceberg/issues/2308,
>> we've proposed a solution but still not reached an agreement in the
>> community.  I will suggest using v2 in production after we resolve this
>> issue at least.
>>
>> On Sat, May 15, 2021 at 8:01 AM Huadong Liu  wrote:
>>
>>> Hi iceberg-dev,
>>>
>>> I tried v2 row-level deletion by committing equality delete files after
>>> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions
>>> to compact delete files and data files
>>> <https://github.com/apache/iceberg/milestone/4> etc. are in progress. I
>>> currently use the JAVA API to update, query and do maintenance ops. I am
>>> not using Flink at the moment and I will definitely pick up Spark actions
>>> when they are completed. Deletions can be scheduled in batches (e.g.
>>> weekly) to control the volume of delete files. I want to get a sense of the
>>> risk level of losing data at some point because of v2 Spec/API changes if I
>>> start to use v2 format now. It is not an easy question. Any input is
>>> appreciated.
>>>
>>> --
>>> Huadong
>>>
>>


Re: Stableness of V2 Spec/API

2021-05-17 Thread OpenInx
Hi Huadong

>From the perspective of iceberg developers, we don't expose the format v2
to end users because we think there is still other work that needs to be
done. As you can see there are still some unfinished issues from your link.
As for whether v2 will cause data loss, from my perspective as a designer,
semantics and correctness should be handled very rigorously if we don't do
any compaction.  Once we introduce the compaction action,  we will
encounter this issue: https://github.com/apache/iceberg/issues/2308,  we've
proposed a solution but still not reached an agreement in the community.  I
will suggest using v2 in production after we resolve this issue at least.

On Sat, May 15, 2021 at 8:01 AM Huadong Liu  wrote:

> Hi iceberg-dev,
>
> I tried v2 row-level deletion by committing equality delete files after
> *upgradeToFormatVersion(2)*. It worked well. I know that Spark actions to
> compact delete files and data files
>  etc. are in progress. I
> currently use the JAVA API to update, query and do maintenance ops. I am
> not using Flink at the moment and I will definitely pick up Spark actions
> when they are completed. Deletions can be scheduled in batches (e.g.
> weekly) to control the volume of delete files. I want to get a sense of the
> risk level of losing data at some point because of v2 Spec/API changes if I
> start to use v2 format now. It is not an easy question. Any input is
> appreciated.
>
> --
> Huadong
>


Re: how to test row level delete

2021-04-06 Thread OpenInx
Hi Chen Song

If want to test the format v2 under your env,  you could follow this
comment https://github.com/apache/iceberg/pull/2410#issuecomment-812463051
to upgrade your iceberg table to format v2.

The TableProperties.FORMAT_VERSION was introduced in a separate PoC PR , so
we could not find this static variable in the current apache iceberg master
branch.

On Wed, Apr 7, 2021 at 3:28 AM Chen Song  wrote:

> Hey I want to quickly follow up on this thread.
>
> I cannot seem to find any pull request to expose V2 format version on
> table creation, specifically for the line below referenced in your
> previous email.
>
> TableProperties.FORMAT_VERSION
>
> Can you suggest? I want to create a V2 table to test some row level
> upserts/deletes.
>
> Chen
>
> On Sun, Dec 27, 2020 at 9:33 PM OpenInx  wrote:
>
>> > you can apply this patch in your own repository
>>
>> The patch is : https://github.com/apache/iceberg/pull/1978
>>
>> On Mon, Dec 28, 2020 at 10:32 AM OpenInx  wrote:
>>
>>> Hi liubo07199
>>>
>>> Thanks for testing the iceberg row-level delete,  I skimmed the code, it
>>> seems you were trying the equality-delete feature.  For iceberg users, I
>>> think we don't have to write those iceberg internal codes to get this work,
>>> this isn't friendly for users.  Instead,  we usually use the
>>> equality-delete ( CDC events ingestion or flink aggregation upsert
>>> streams)  feature based on the compute-engine work. Currently,  we've
>>> supported the flink cdc-events integration (Flink Datastream integration
>>> has been merged [1] while the Flink SQL integration depends on the time
>>> when we are ready to expose iceberg format v2 [2])
>>>
>>> About what's the time to expose format v2 to users, you may want to read
>>> this mail [3].
>>>
>>> If you just want to have a basic test for writing cdc by flink,  you can
>>> apply this patch in your own repository,  and then create an iceberg table
>>> with an extra option like the following:
>>>
>>> public static Table createTable(String path, Map 
>>> properties, boolean partitioned) {
>>>   PartitionSpec spec;
>>>   if (partitioned) {
>>> spec = PartitionSpec.builderFor(SCHEMA).identity("data").build();
>>>   } else {
>>> spec = PartitionSpec.unpartitioned();
>>>   }
>>>   properties.put(TableProperties.FORMAT_VERSION, "2");
>>>   return new HadoopTables().create(SCHEMA, spec, properties, path);
>>> }
>>>
>>> Then use the flink data stream api or flink sql to write the cdc events
>>> into an apache iceberg table.  For data stream job to sinking cdc events I
>>> suggest to use the similar way here [4].
>>>
>>> I'd like to help if you have further feedback.
>>>
>>> Thanks.
>>>
>>> [1]. https://github.com/apache/iceberg/pull/1974
>>> [2]. https://github.com/apache/iceberg/pull/1978
>>> [3].
>>> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202012.mbox/%3CCACc8XkGt%2B5kxr-XRMgY1eUKjd70mej38KFbhDuV2MH3AVMON2g%40mail.gmail.com%3E
>>> [4].
>>> https://github.com/apache/iceberg/pull/1974/files#diff-13e2e5b52d0effe51e1b470df77cb08b5ec8cc4f3a7f0fd4e51ee212fc83f76aR143
>>>
>>> On Sat, Dec 26, 2020 at 7:14 PM 1  wrote:
>>>
>>>> Hi, all:
>>>>
>>>> I want to try row level delete, but get the exception : 
>>>> IllegalArgumentException:
>>>> Cannot write delete files in a v1 table.
>>>> I look over https://iceberg.apache.org/spec/#table-metadata for
>>>> format-version, it said that An integer version number for the format.
>>>> Currently, this is always 1. Implementations must throw an exception if a
>>>> table’s version is higher than the supported version. so what can i do to
>>>> test row-level deletion ?
>>>>   So what can I do to have a try to row level delete?  how to create a
>>>> v2 table ?
>>>>
>>>> thx
>>>>
>>>> Code is :
>>>>
>>>> private static void deleteRead() throws IOException {
>>>> Schema deleteRowSchema = table.schema().select("id");
>>>> Record dataDelete = GenericRecord.create(deleteRowSchema);
>>>> List dataDeletes = Lists.newArrayList(
>>>> dataDelete.copy("id", 11), // id = 29
>>>> dataDelete.copy("id", 12), // id = 89
>>>> da

Re: When is the next release of Iceberg ?

2021-04-02 Thread OpenInx
Hi Himanshu

If you want to try the flink + iceberg fo syncing mysql binlog to iceberg
table,  you might be interested in those PRs:

1. https://github.com/apache/iceberg/pull/2410
2. https://github.com/apache/iceberg/pull/2303

On Wed, Mar 24, 2021 at 10:34 AM OpenInx  wrote:

> Hi Himanshu
>
> Thanks for the email,  currently we flink+iceberg support writing CDC
> events into apache iceberg table by flink datastream API, besides the
> spark/presto/hive could read those events in batch job.
>
> But there are still some issues that we do not finish yet:
>
> 1.  Expose the iceberg v2 to end users.  The row-level delete feature is
> actually built on the iceberg format v2,  there are still some blockers
> that we need to fix (pls see the document
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit),
> we iceberg team will need some resources to resolve them.
> 2.  As we know the CDC events depend on iceberg primary key
> identification  (Then we could define mysql_cdc sql table by using primary
> key cause) I saw Jack Ye has published a PR to this
> https://github.com/apache/iceberg/pull/2354,  I will review it today.
> 3.  The CDC writers will produce many small files inevitably as the
> periodic checkpoints go on,  so for the real production env we must provide
> the ability to rewrite small files into larger files ( compaction action)
> .  There are few PRs needing to be reviewing:
>a.  https://github.com/apache/iceberg/pull/2303/files
>b.  https://github.com/apache/iceberg/pull/2294
>c.  https://github.com/apache/iceberg/pull/2216
>
> I think it's better to resolve all those issues before we put the
> production data into iceberg ( syncing mysql binlog via debezium).  I saw
> the last sync notes saying  the next release 0.12.0 would be released in
> end of this month ideally (
> https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E)
> ,  I think that  that deadline is too tight.  In my mind,  if the release
> 0.12.0 won't expose the format v2 to end users, then what are the core
> features that we want to release ?  If the features that we plan to release
> are not major ones,  then how about releasing the 0.11.2 ?
>
> According to my understanding of the needs of community users, the vast
> majority of iceberg users have high expectations for format v2. I think we
> may need to raise the v2 exposure to a higher priority so that our users
> can do the whole PoC tests earlier.
>
>
>
> On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore
>  wrote:
>
>> We are planning for use Flink + Iceberg for syncing mysql binlog's via
>> debezium and its seams of things are dependent on next release.
>>
>


Re: Welcoming Ryan Murray as a new committer!

2021-03-29 Thread OpenInx
Congrats, Ryan !  Well-deserved !

On Tue, Mar 30, 2021 at 9:32 AM Junjie Chen 
wrote:

> Congratulations. Ryan!
>
> On Tue, Mar 30, 2021 at 5:02 AM Daniel Weeks 
> wrote:
>
>> Congrats, Ryan and thanks for all the great work!
>>
>> On Mon, Mar 29, 2021 at 1:59 PM Ryan Blue 
>> wrote:
>>
>>> Congratulations, Ryan!
>>>
>>> On Mon, Mar 29, 2021 at 1:49 PM Thirumalesh Reddy <
>>> thirumal...@dremio.com> wrote:
>>>
 Congratulations Ryan

 Thirumalesh Reddy
 Dremio | VP of Engineering


 On Mon, Mar 29, 2021 at 9:16 AM Xinli shang 
 wrote:

> Congratulations Ryan!
>
> On Mon, Mar 29, 2021 at 9:13 AM Russell Spitzer <
> russell.spit...@gmail.com> wrote:
>
>> :D We will always be Iceberg-comitter twins now
>>
>> > On Mar 29, 2021, at 11:10 AM, Szehon Ho 
>> wrote:
>> >
>> > That’s awesome, great work Ryan.
>> >
>> > Szehon
>> >
>> >> On 29 Mar 2021, at 18:08, Anton Okolnychyi
>>  wrote:
>> >>
>> >> Hey folks,
>> >>
>> >> I’d like to welcome Ryan Murray as a new committer to the project!
>> >>
>> >> Thanks for all the hard work, Ryan!
>> >>
>> >> - Anton
>> >
>>
>>
>
> --
> Xinli Shang
>

>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Best Regards
>


Re: Welcoming Russell Spitzer as a new committer

2021-03-29 Thread OpenInx
Congrats, Russell !  Well-deserved !

On Tue, Mar 30, 2021 at 9:33 AM Junjie Chen 
wrote:

>   Congratulations, Russell! Nice work!
>
> On Tue, Mar 30, 2021 at 5:02 AM Daniel Weeks 
> wrote:
>
>> Congrats, Russell!
>>
>> On Mon, Mar 29, 2021 at 1:59 PM Ryan Blue 
>> wrote:
>>
>>> Congratulations, Russell!
>>>
>>> -- Forwarded message -
>>> From: Gautam Kowshik 
>>> Date: Mon, Mar 29, 2021 at 12:16 PM
>>> Subject: Re: Welcoming Russell Spitzer as a new committer
>>> To: 
>>>
>>>
>>> Congrats Russell!
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 29, 2021, at 9:41 AM, Dilip Biswal  wrote:
>>>
>>> 
>>> Congratulations Russel !! Very well deserved, indeed !!
>>>
>>> On Mon, Mar 29, 2021 at 9:13 AM Miao Wang 
>>> wrote:
>>>
 Congratulations Russell!



 Miao



 *From: *Szehon Ho 
 *Reply-To: *"dev@iceberg.apache.org" 
 *Date: *Monday, March 29, 2021 at 9:12 AM
 *To: *"dev@iceberg.apache.org" 
 *Subject: *Re: Welcoming Russell Spitzer as a new committer



 Awesome, well-deserved, Russell!



 Szehon



 On 29 Mar 2021, at 18:10, Holden Karau  wrote:



 Congratulations Russel!



 On Mon, Mar 29, 2021 at 9:10 AM Anton Okolnychyi <
 aokolnyc...@apple.com.invalid> wrote:

 Hey folks,

 I’d like to welcome Russell Spitzer as a new committer to the project!

 Thanks for all your contributions, Russell!

 - Anton

 --

 Twitter: https://twitter.com/holdenkarau
 

 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9
 

 YouTube Live Streams: https://www.youtube.com/user/holdenkarau
 



>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Best Regards
>


Re: Welcoming Yan Yan as a new committer!

2021-03-23 Thread OpenInx
Congrats Yan !   You deserve it.

On Wed, Mar 24, 2021 at 7:18 AM Miao Wang  wrote:

> Congrats @Yan Yan !
>
>
>
> Miao
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"dev@iceberg.apache.org" 
> *Date: *Tuesday, March 23, 2021 at 3:43 PM
> *To: *Iceberg Dev List 
> *Subject: *Welcoming Yan Yan as a new committer!
>
>
>
> Hi everyone,
>
> I'd like to welcome Yan Yan as a new Iceberg committer.
>
> Thanks for all your contributions, Yan!
>
> rb
>
>
>
> --
>
> Ryan Blue
>


Re: When is the next release of Iceberg ?

2021-03-23 Thread OpenInx
Hi Himanshu

Thanks for the email,  currently we flink+iceberg support writing CDC
events into apache iceberg table by flink datastream API, besides the
spark/presto/hive could read those events in batch job.

But there are still some issues that we do not finish yet:

1.  Expose the iceberg v2 to end users.  The row-level delete feature is
actually built on the iceberg format v2,  there are still some blockers
that we need to fix (pls see the document
https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit),
we iceberg team will need some resources to resolve them.
2.  As we know the CDC events depend on iceberg primary key identification
(Then we could define mysql_cdc sql table by using primary key cause) I saw
Jack Ye has published a PR to this
https://github.com/apache/iceberg/pull/2354,  I will review it today.
3.  The CDC writers will produce many small files inevitably as the
periodic checkpoints go on,  so for the real production env we must provide
the ability to rewrite small files into larger files ( compaction action)
.  There are few PRs needing to be reviewing:
   a.  https://github.com/apache/iceberg/pull/2303/files
   b.  https://github.com/apache/iceberg/pull/2294
   c.  https://github.com/apache/iceberg/pull/2216

I think it's better to resolve all those issues before we put the
production data into iceberg ( syncing mysql binlog via debezium).  I saw
the last sync notes saying  the next release 0.12.0 would be released in
end of this month ideally (
https://lists.apache.org/x/thread.html/rdb7d1ab221295adec33cf93dcbcac2b9b7b80708b2efd903b7105511@%3Cdev.iceberg.apache.org%3E)
,  I think that  that deadline is too tight.  In my mind,  if the release
0.12.0 won't expose the format v2 to end users, then what are the core
features that we want to release ?  If the features that we plan to release
are not major ones,  then how about releasing the 0.11.2 ?

According to my understanding of the needs of community users, the vast
majority of iceberg users have high expectations for format v2. I think we
may need to raise the v2 exposure to a higher priority so that our users
can do the whole PoC tests earlier.



On Wed, Mar 24, 2021 at 3:49 AM Himanshu Rathore
 wrote:

> We are planning for use Flink + Iceberg for syncing mysql binlog's via
> debezium and its seams of things are dependent on next release.
>


Sync: the progress of row-level delete

2021-03-14 Thread OpenInx
Hi iceberg dev:

Currently,   Junjie Chen and I have made some progress about the Rewrite
Action for format v2.  We will have two kinds of Rewrite Action:

1.   The first one is rewriting equality delete rows into position delete
rows.  The PoC PR is here: https://github.com/apache/iceberg/pull/2216
2.  The second one is removing all deletes when rewrite.  The PR is:
https://github.com/apache/iceberg/pull/2303

The motivation that Junjie and I made the priority of RewriteAction a bit
higher is:  we have some Asia companies who are doing the PoC about writing
CDC/Upsert events into iceberg tables and then read it by batch
flink/spark/presto job.   The biggest bottleneck is small delete/data
files, as the streaming job checkpoint periodically,  it will produce so
many small data/equality/pos files in the underlying filesystem,  that will
affect read performance.

About the implementation of RewriteAction,  I think we are confident to
accomplish this.  The key problem is:  How to handle the conflicts between
RewriteFiles txn and  RowDelta txn ?   I filed an issue here:
https://github.com/apache/iceberg/issues/2308

In my opinion,   The RewriteFiles action will never change the data set of
the iceberg table, I mean it will not even add/remove/change a row.  So
from the database developer perspective,  it should not conflict with the
normal rewrite actions because there's no key/row overlap between the two
actions.  But for the iceberg implementation,  we have to handle the
conflicts because both RewriteAction and RowDelta txn are sharing the same
increasing sequence number.

Let's discuss the case from ISSUE#2308:

The original table data set will have data set with seq id1:

Seq1:  (RowDelta 1)
INSERT,  <1, A>
INSERT,  <2, B>
DELETE, <1, A>

If RewriteAction commit before the following RowDelta, then will have the
following operations with the sequence number: ( Finally, it will get the
empty set when reading from the latest snapshot)

Seq2:  (Rewrite)
INSERT, <2, B>

Seq3:  (RowDelta 2)
DELETE, <2, B>

While if RowDelta commit before the RewriteAction, then will have the
following operations with sequence number:   (Finally, it will get the <2,
B> when reading from the latest snapshot )

Seq2: (RowDelta 2)
DELETE, <2, B>

Seq3: (Rewrite)
INSERT, <2,B>


Summary:   As we can see,  different commit orders will produce different
data sets in the iceberg table, that's not the expected semantic from a
user perspective.   So I'm considering the RewriteFilesAction could just
commit the txn without producing a new auto-increasing sequence id (use the
largest sequence number among the existing files for RewriteAction) ,  then
the results will always be consistent without considering the commit
order.Since this change is touching the iceberg table format/spec,  I'd
like to hear your voice.  What do you think about this thing ?

Thanks.


Re: Secondary Indexes - Pluggable File Filter interface for Apache Iceberg

2021-03-03 Thread OpenInx
It will be  1:00 AM (China Standard Time) on 18 March,  and it works for
our Asia people.   I'd love to attend this discussion, Thanks.

On Thu, Mar 4, 2021 at 9:50 AM Ryan Blue  wrote:

> Thanks for putting this together, Guy! I just did a pass over the doc and
> it looks like a really reasonable proposal for being able to inject custom
> file filter implementations.
>
> One of the main things we need to think about is how to store and track
> the index data. There's a comment in the doc about storing them in a
> "consolidated fashion" and I'd like to hear more about what you're thinking
> there. The index-per-file approach that Adobe is working on is a good way
> to track index data because we get a clear lifecycle for index data because
> it is tied to a data file that is immutable. On the other hand, the
> drawback is that we have a lot of index files -- one per data file.
>
> Let's set up a time to go talk through the options. Would 9AM PST (17:00
> UTC) on 17 March work for everyone? I'm thinking in the morning so everyone
> from IBM can attend. We can do a second discussion at a time that works
> more for people in Asia later on as well.
>
> If that day works, then I'll send out an invite.
>
> On Fri, Feb 19, 2021 at 8:49 AM Guy Khazma  wrote:
>
>> Hi All,
>>
>> Following up on our discussion from Wednesday sync here attached is a
>> proposal to enhance iceberg with a pluggable interface for data skipping
>> indexes to enable use of existing indexes in job planning.
>>
>>
>> https://docs.google.com/document/d/11o3T7XQVITY_5F9Vbri9lF9oJjDZKjHIso7K8tEaFfY/edit?usp=sharing
>>
>> We will be glad to get you feedback.
>>
>> Thanks,
>> Guy
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Sync to discuss secondary index proposal

2021-01-28 Thread OpenInx
Sorry  I sent the wrong link,  the secondary index document link is:
https://docs.google.com/document/d/1E1ofBQoKRnX04bWT3utgyHQGaHZoelgXosk_UNsTUuQ/edit

On Fri, Jan 29, 2021 at 10:31 AM OpenInx  wrote:

> Hi
>
> @Miao WangWould you mind to share your current PoC
> code or PR  for this document [1]  if possible ?   I'd like to understand
> more details before I get involved in this discussion.
>
> Thanks.
>
> [1].
> https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit?ts=601316b0#
>
> On Fri, Jan 29, 2021 at 10:16 AM 李响  wrote:
>
>> +1, my colleagues and I is at UTC+8
>>
>> On Fri, Jan 29, 2021 at 9:50 AM OpenInx  wrote:
>>
>>> +1,  my time zone is CST.
>>>
>>> On Fri, Jan 29, 2021 at 6:57 AM Xinli shang 
>>> wrote:
>>>
>>>> I had some earlier discussion with Miao on this. I am still
>>>> interested in it. My time zone is PST.
>>>>
>>>> On Thu, Jan 28, 2021 at 2:50 PM Jack Ye  wrote:
>>>>
>>>>> +1, looking forward to the discussion, please include me and Yan (
>>>>> yyany...@gmail.com), also in PST.
>>>>> -Jack
>>>>>
>>>>> On Thu, Jan 28, 2021 at 2:16 PM Russell Spitzer <
>>>>> russell.spit...@gmail.com> wrote:
>>>>>
>>>>>> CST Please :) But I don’t mind waking up early or staying up late as
>>>>>> required
>>>>>>
>>>>>> On Jan 28, 2021, at 4:14 PM, Ryan Blue 
>>>>>> wrote:
>>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> The proposal that Miao wrote about secondary indexes has come up a
>>>>>> lot lately. I think it would be a good time to have a discussion about 
>>>>>> the
>>>>>> proposal and set some initial goals for what we want to do next. Since
>>>>>> there hasn't been much discussion on the dev list, I'll schedule a sync 
>>>>>> so
>>>>>> that everyone has a deadline to read the proposal and be ready with
>>>>>> questions. Then we can have a quick summary to start with and a 
>>>>>> productive
>>>>>> discussion.
>>>>>>
>>>>>> First, who is interested in attending? I think that Miao and I are in
>>>>>> the US in PST (UTC-8). I think Paula from IBM in Israel is interested.
>>>>>> Anyone else in a time zone that we should try to include? We can always
>>>>>> have two discussions if we need to include more zones.
>>>>>>
>>>>>> Please reply if you're interested so we can get something set up.
>>>>>> Thanks!
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Xinli Shang
>>>>
>>>
>>
>> --
>>
>>李响 Xiang Li
>>
>> 手机 cellphone :+86-136-8113-8972
>> 邮件 e-mail  :wate...@gmail.com
>>
>


Re: Sync to discuss secondary index proposal

2021-01-28 Thread OpenInx
Hi

@Miao WangWould you mind to share your current PoC
code or PR  for this document [1]  if possible ?   I'd like to understand
more details before I get involved in this discussion.

Thanks.

[1].
https://docs.google.com/document/d/1q6xaBxUPFwYsW9aXWxYUh7die6O7rDeAPFQcTAMQ0GM/edit?ts=601316b0#

On Fri, Jan 29, 2021 at 10:16 AM 李响  wrote:

> +1, my colleagues and I is at UTC+8
>
> On Fri, Jan 29, 2021 at 9:50 AM OpenInx  wrote:
>
>> +1,  my time zone is CST.
>>
>> On Fri, Jan 29, 2021 at 6:57 AM Xinli shang 
>> wrote:
>>
>>> I had some earlier discussion with Miao on this. I am still
>>> interested in it. My time zone is PST.
>>>
>>> On Thu, Jan 28, 2021 at 2:50 PM Jack Ye  wrote:
>>>
>>>> +1, looking forward to the discussion, please include me and Yan (
>>>> yyany...@gmail.com), also in PST.
>>>> -Jack
>>>>
>>>> On Thu, Jan 28, 2021 at 2:16 PM Russell Spitzer <
>>>> russell.spit...@gmail.com> wrote:
>>>>
>>>>> CST Please :) But I don’t mind waking up early or staying up late as
>>>>> required
>>>>>
>>>>> On Jan 28, 2021, at 4:14 PM, Ryan Blue 
>>>>> wrote:
>>>>>
>>>>> Hi everyone,
>>>>>
>>>>> The proposal that Miao wrote about secondary indexes has come up a lot
>>>>> lately. I think it would be a good time to have a discussion about the
>>>>> proposal and set some initial goals for what we want to do next. Since
>>>>> there hasn't been much discussion on the dev list, I'll schedule a sync so
>>>>> that everyone has a deadline to read the proposal and be ready with
>>>>> questions. Then we can have a quick summary to start with and a productive
>>>>> discussion.
>>>>>
>>>>> First, who is interested in attending? I think that Miao and I are in
>>>>> the US in PST (UTC-8). I think Paula from IBM in Israel is interested.
>>>>> Anyone else in a time zone that we should try to include? We can always
>>>>> have two discussions if we need to include more zones.
>>>>>
>>>>> Please reply if you're interested so we can get something set up.
>>>>> Thanks!
>>>>>
>>>>> rb
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> Xinli Shang
>>>
>>
>
> --
>
>李响 Xiang Li
>
> 手机 cellphone :+86-136-8113-8972
> 邮件 e-mail  :wate...@gmail.com
>


Re: [VOTE] Release Apache Iceberg 0.11.0 RC0

2021-01-25 Thread OpenInx
Hi dev

I'd like to include this patch in release 0.11.0  because it's the document
of new flink features.  I'm sorry that I did not update the flink's
document in time when the feature code merged, but I think it's worth it to
merge this document PR when we release iceberg 0.11.0, that helps a lot for
users who want to use those new features, such as streaming reader,
rewrite data files action,  write distribution to cluster data.  (I will
update those documents in time so that we won't rollback the next release).

Thanks.


On Tue, Jan 26, 2021 at 10:17 AM Anton Okolnychyi
 wrote:

> +1 (binding)
>
> I did local tests with Spark 3.0.1. I think we should also note the
> support for DELETE FROM and MERGE INTO in Spark is experimental.
>
> Thanks,
> Anton
>
> On 22 Jan 2021, at 15:26, Jack Ye  wrote:
>
> Hi everyone,
>
> I propose the following RC to be released as the official Apache Iceberg
> 0.11.0 release. The RC is also reviewed and signed by Ryan Blue.
>
> The commit id is ad78cc6cf259b7a0c66ab5de6675cc005febd939
>
> This corresponds to the tag: apache-iceberg-0.11.0-rc0
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.11.0-rc0
> * https://github.com/apache/iceberg/tree/apache-iceberg-0.11.0-rc0
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.11.0-rc0
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged in Nexus. The Maven repository URL
> is:
> * https://repository.apache.org/content/repositories/orgapacheiceberg-1015
>
> This release includes the following changes:
>
> *High-level features*
>
>- Core API now supports partition spec and sort order evolution
>- Spark 3 now supports the following SQL extensions:
>   - MERGE INTO
>   - DELETE FROM
>   - ALTER TABLE ... ADD/DROP PARTITION
>   - ALTER TABLE ... WRITE ORDERED BY
>   - invoke stored procedures using CALL
>- Flink now supports streaming reads, CDC writes (experimental), and
>filter pushdown
>- AWS module is added to support better integration with AWS, with AWS
>Glue catalog  support and dedicated S3
>FileIO implementation
>- Nessie module is added to support integration with project Nessie
>
>
> *Important bug fixes*
>
>- #1981 fixes date and timestamp transforms
>- #2091 fixes Parquet vectorized reads when column types are promoted
>- #1962 fixes Parquet vectorized position reader
>- #1991 fixes Avro schema conversions to preserve field docs
>- #1811 makes refreshing Spark cache optional
>- #1798 fixes read failure when encountering duplicate entries of data
>files
>- #1785 fixes invalidation of metadata tables in CachingCatalog
>- #1784 fixes resolving of SparkSession table's metadata tables
>
> *Other notable changes*
>
>- NaN counter is added to format v2 metrics
>- Shared catalog properties are added in core library to standardize
>catalog level configurations
>- Spark and Flink now supports dynamically loading customized
>`Catalog` and `FileIO` implementations
>- Spark now supports loading tables with file paths via HadoopTables
>- Spark 2 now supports loading tables from other catalogs, like Spark 3
>- Spark 3 now supports catalog names in DataFrameReader when using
>Iceberg as a format
>- Hive now supports INSERT INTO, case insensitive query, projection
>pushdown, create DDL with schema and auto type conversion
>- ORC now supports reading tinyint, smallint, char, varchar types
>- Hadoop catalog now supports role-based access of table listing
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.11.0
> [ ] +0
> [ ] -1 Do not release this because...
>
>
>


Re: Welcoming Peter Vary as a new committer!

2021-01-25 Thread OpenInx
Congratulations and welcome Peter !

On Tue, Jan 26, 2021 at 9:41 AM Junjie Chen 
wrote:

> Congratulations!
>
> On Tue, Jan 26, 2021 at 8:26 AM Jun H.  wrote:
>
>> Congratulations
>>
>> On Mon, Jan 25, 2021 at 4:18 PM Yan Yan  wrote:
>> >
>> > Congratulations!
>> >
>> > On Mon, Jan 25, 2021 at 3:03 PM Jun Zhang 
>> wrote:
>> >>
>> >> Congratulations
>> >> On 01/26/2021 04:25, Driesprong, Fokko wrote:
>> >>
>> >> Congratulations!
>> >>
>> >> Op ma 25 jan. 2021 om 21:16 schreef Mass Dosage 
>> >>>
>> >>> Nice one, well done Peter!
>> >>>
>> >>> On Mon, 25 Jan 2021 at 19:46, Daniel Weeks  wrote:
>> 
>>  Congratulations, Peter!
>> 
>>  On Mon, Jan 25, 2021, 11:27 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >
>> > Congratulations Peter! Well deserved!
>> >
>> > On Tue, Jan 26, 2021 at 3:40 AM Wing Yew Poon
>>  wrote:
>> >>
>> >> Congratulations Peter!
>> >>
>> >>
>> >> On Mon, Jan 25, 2021 at 10:35 AM Russell Spitzer <
>> russell.spit...@gmail.com> wrote:
>> >>>
>> >>> Congratulations!
>> >>>
>> >>> On Jan 25, 2021, at 12:34 PM, Jacques Nadeau <
>> jacquesnad...@gmail.com> wrote:
>> >>>
>> >>> Congrats Peter! Thanks for all your great work
>> >>>
>> >>> On Mon, Jan 25, 2021 at 10:24 AM Ryan Blue 
>> wrote:
>> 
>>  Hi everyone,
>> 
>>  I'd like to welcome Peter Vary as a new Iceberg committer.
>> 
>>  Thanks for all your contributions, Peter!
>> 
>>  rb
>> 
>>  --
>>  Ryan Blue
>> >>>
>> >>>
>>
>
>
> --
> Best Regards
>


Re: test flakiness with SocketException of broken pipe in HiveMetaStoreClient

2021-01-08 Thread OpenInx
OK, there's a  try-with-resource to close the TableLoader  in
FlinkInputFormat [1].   so we don't have to do the extra try-with-resource
in PR 2051 ( I will close that).

Under my host,  I did not reproduce your connection leak issues when
running TestFlinkInputFormatReaderDeletes.  Did you have any extra usage
about the table loader and forget to close it in your flip-27 dev branch ?

[1].
https://github.com/apache/iceberg/blob/7645ceba65044184be192a7194a38729133b2e50/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java#L77

On Fri, Jan 8, 2021 at 3:36 PM OpenInx  wrote:

> > I was able to almost 100% reproduce the HiveMetaStoreClient aborted
> connection problem locally with Flink tests after adding
> another DeleteReadTests for the new FLIP-27 source impl in my dev branch
>
> I think I found the cause why it's easy to fail.   The
> TestFlinkInputFormatReaderDeletes will create a new CatalogLoader [1] for
> loading table purposes inside the FlinkInputFormat.
>
> TestHelpers.readRowData(inputFormat, rowType).forEach(rowData -> {
>   RowDataWrapper wrapper = new RowDataWrapper(rowType,
> projected.asStruct());
>   set.add(wrapper.wrap(rowData));
> });
>
> When TestHelpers#readRowData,  it will open a new catalog ( that means
> opening a new hive connection). But after we finished the read processing,
> we did not close the TableLoader, which leaks the catalog connection. I
> opened a PR [2] to fix this issue,  will it work in your branch ?
>
> I think it's worth keeping those hive catalog unit tests so that we could
> detect those connection leak issues in time.
>
> [1].
> https://github.com/apache/iceberg/blob/4436c92928f4b3b90839a26bf6a656902733261f/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormatReaderDeletes.java#L114
> [2]. https://github.com/apache/iceberg/pull/2051/files
>
> On Fri, Jan 8, 2021 at 5:48 AM Steven Wu  wrote:
>
>> Ryan/OpenInx, thanks a lot for the pointers.
>>
>> I was able to almost 100% reproduce the HiveMetaStoreClient aborted
>> connection problem locally with Flink tests after adding
>> another DeleteReadTests for the new FLIP-27 source impl in my dev branch. I
>> don't see the problem anymore after switching the Flink DeleteReadTests
>> from the HiveCatalog (requiring expensive TestHiveMetastore) to
>> HadoopCatalog.
>>
>> There is still a base test class FlinkTestBase using the HiveCatalog. I
>> am wondering if there is a value for using the more expensive HiveCatalog
>> than the HadoopCatalog?
>>
>> On Wed, Jan 6, 2021 at 6:22 PM OpenInx  wrote:
>>
>>> I encountered a similar issue when supporting hive-site.xml for flink
>>> hive catalog.  Here is the discussion and solution before:
>>> https://github.com/apache/iceberg/pull/1586#discussion_r509453461
>>>
>>> It's a connection leak issue.
>>>
>>>
>>> On Thu, Jan 7, 2021 at 10:06 AM Ryan Blue 
>>> wrote:
>>>
>>>> I've noticed this too. I haven't had a chance to track down what's
>>>> causing it yet. I've seen it in Spark tests, so it looks like there may be
>>>> a problem that affects both. Probably a connection leak in the common code.
>>>>
>>>> On Wed, Jan 6, 2021 at 3:44 PM Steven Wu  wrote:
>>>>
>>>>> I have noticed some flakiness with Flink and Spark tests both locally
>>>>> and in CI checks. @zhangjun0x01 also reported the same problem with
>>>>> iceberg-spark3-extensions.  Below is a full stack trace from a local
>>>>> run for Flink tests.
>>>>>
>>>>> The flakiness might be recent regression, as the tests were stable for
>>>>> me until recently. Any recent hive dep change? Anyone have any ideas?
>>>>>
>>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes >
>>>>> testMixedPositionAndEqualityDeletes[fileFormat=ORC] FAILED
>>>>>
>>>>> java.lang.RuntimeException: Failed to get table info from
>>>>> metastore default.test
>>>>>
>>>>> at
>>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:142)
>>>>>
>>>>> at
>>>>> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:86)
>>>>>
>>>>> at
>>>>> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:69)
>>>>>
>>>>> at
>>>&g

Re: test flakiness with SocketException of broken pipe in HiveMetaStoreClient

2021-01-07 Thread OpenInx
> I was able to almost 100% reproduce the HiveMetaStoreClient aborted
connection problem locally with Flink tests after adding
another DeleteReadTests for the new FLIP-27 source impl in my dev branch

I think I found the cause why it's easy to fail.   The
TestFlinkInputFormatReaderDeletes will create a new CatalogLoader [1] for
loading table purposes inside the FlinkInputFormat.

TestHelpers.readRowData(inputFormat, rowType).forEach(rowData -> {
  RowDataWrapper wrapper = new RowDataWrapper(rowType,
projected.asStruct());
  set.add(wrapper.wrap(rowData));
});

When TestHelpers#readRowData,  it will open a new catalog ( that means
opening a new hive connection). But after we finished the read processing,
we did not close the TableLoader, which leaks the catalog connection. I
opened a PR [2] to fix this issue,  will it work in your branch ?

I think it's worth keeping those hive catalog unit tests so that we could
detect those connection leak issues in time.

[1].
https://github.com/apache/iceberg/blob/4436c92928f4b3b90839a26bf6a656902733261f/flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkInputFormatReaderDeletes.java#L114
[2]. https://github.com/apache/iceberg/pull/2051/files

On Fri, Jan 8, 2021 at 5:48 AM Steven Wu  wrote:

> Ryan/OpenInx, thanks a lot for the pointers.
>
> I was able to almost 100% reproduce the HiveMetaStoreClient aborted
> connection problem locally with Flink tests after adding
> another DeleteReadTests for the new FLIP-27 source impl in my dev branch. I
> don't see the problem anymore after switching the Flink DeleteReadTests
> from the HiveCatalog (requiring expensive TestHiveMetastore) to
> HadoopCatalog.
>
> There is still a base test class FlinkTestBase using the HiveCatalog. I am
> wondering if there is a value for using the more expensive HiveCatalog than
> the HadoopCatalog?
>
> On Wed, Jan 6, 2021 at 6:22 PM OpenInx  wrote:
>
>> I encountered a similar issue when supporting hive-site.xml for flink
>> hive catalog.  Here is the discussion and solution before:
>> https://github.com/apache/iceberg/pull/1586#discussion_r509453461
>>
>> It's a connection leak issue.
>>
>>
>> On Thu, Jan 7, 2021 at 10:06 AM Ryan Blue 
>> wrote:
>>
>>> I've noticed this too. I haven't had a chance to track down what's
>>> causing it yet. I've seen it in Spark tests, so it looks like there may be
>>> a problem that affects both. Probably a connection leak in the common code.
>>>
>>> On Wed, Jan 6, 2021 at 3:44 PM Steven Wu  wrote:
>>>
>>>> I have noticed some flakiness with Flink and Spark tests both locally
>>>> and in CI checks. @zhangjun0x01 also reported the same problem with
>>>> iceberg-spark3-extensions.  Below is a full stack trace from a local
>>>> run for Flink tests.
>>>>
>>>> The flakiness might be recent regression, as the tests were stable for
>>>> me until recently. Any recent hive dep change? Anyone have any ideas?
>>>>
>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes >
>>>> testMixedPositionAndEqualityDeletes[fileFormat=ORC] FAILED
>>>>
>>>> java.lang.RuntimeException: Failed to get table info from
>>>> metastore default.test
>>>>
>>>> at
>>>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:142)
>>>>
>>>> at
>>>> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:86)
>>>>
>>>> at
>>>> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:69)
>>>>
>>>> at
>>>> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:92)
>>>>
>>>> at
>>>> org.apache.iceberg.flink.TableLoader$CatalogTableLoader.loadTable(TableLoader.java:113)
>>>>
>>>> at
>>>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes.rowSet(TestIcebergSourceReaderDeletes.java:90)
>>>>
>>>>
>>>> Caused by:
>>>>
>>>> org.apache.thrift.transport.TTransportException:
>>>> java.net.SocketException: Broken pipe (Write failed)
>>>>
>>>> at
>>>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
>>>>
>>>> at
>>>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)
>>>>
>>>

Re: test flakiness with SocketException of broken pipe in HiveMetaStoreClient

2021-01-06 Thread OpenInx
I encountered a similar issue when supporting hive-site.xml for flink hive
catalog.  Here is the discussion and solution before:
https://github.com/apache/iceberg/pull/1586#discussion_r509453461

It's a connection leak issue.


On Thu, Jan 7, 2021 at 10:06 AM Ryan Blue  wrote:

> I've noticed this too. I haven't had a chance to track down what's causing
> it yet. I've seen it in Spark tests, so it looks like there may be a
> problem that affects both. Probably a connection leak in the common code.
>
> On Wed, Jan 6, 2021 at 3:44 PM Steven Wu  wrote:
>
>> I have noticed some flakiness with Flink and Spark tests both locally and
>> in CI checks. @zhangjun0x01 also reported the same problem with
>> iceberg-spark3-extensions.  Below is a full stack trace from a local run
>> for Flink tests.
>>
>> The flakiness might be recent regression, as the tests were stable for me
>> until recently. Any recent hive dep change? Anyone have any ideas?
>>
>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes >
>> testMixedPositionAndEqualityDeletes[fileFormat=ORC] FAILED
>>
>> java.lang.RuntimeException: Failed to get table info from metastore
>> default.test
>>
>> at
>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:142)
>>
>> at
>> org.apache.iceberg.BaseMetastoreTableOperations.refresh(BaseMetastoreTableOperations.java:86)
>>
>> at
>> org.apache.iceberg.BaseMetastoreTableOperations.current(BaseMetastoreTableOperations.java:69)
>>
>> at
>> org.apache.iceberg.BaseMetastoreCatalog.loadTable(BaseMetastoreCatalog.java:92)
>>
>> at
>> org.apache.iceberg.flink.TableLoader$CatalogTableLoader.loadTable(TableLoader.java:113)
>>
>> at
>> org.apache.iceberg.flink.source.TestIcebergSourceReaderDeletes.rowSet(TestIcebergSourceReaderDeletes.java:90)
>>
>>
>> Caused by:
>>
>> org.apache.thrift.transport.TTransportException:
>> java.net.SocketException: Broken pipe (Write failed)
>>
>> at
>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:161)
>>
>> at
>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:73)
>>
>> at
>> org.apache.thrift.TServiceClient.sendBase(TServiceClient.java:62)
>>
>> at
>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.send_get_table_req(ThriftHiveMetastore.java:1561)
>>
>> at
>> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1553)
>>
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350)
>>
>> at
>> org.apache.iceberg.hive.HiveTableOperations.lambda$doRefresh$0(HiveTableOperations.java:130)
>>
>> at org.apache.iceberg.hive.ClientPool.run(ClientPool.java:65)
>>
>> at
>> org.apache.iceberg.hive.HiveTableOperations.doRefresh(HiveTableOperations.java:130)
>>
>> ... 5 more
>>
>>
>> Caused by:
>>
>> java.net.SocketException: Broken pipe (Write failed)
>>
>> at java.net.SocketOutputStream.socketWrite0(Native
>> Method)
>>
>> at
>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>>
>> at
>> java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>>
>> at
>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>>
>> at
>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>>
>> at
>> org.apache.thrift.transport.TIOStreamTransport.flush(TIOStreamTransport.java:159)
>>
>> ... 13 more
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: how to generate a new .v1.metadata.json.crc for v1.metadata.json

2020-12-27 Thread OpenInx
You edited the v1.metadata.json to support iceberg format v2 ?  That's not
the correct way to use iceberg format v2.   let's discuss this issue in the
latest  email .

On Sat, Dec 26, 2020 at 7:01 PM 1  wrote:

> Hi, all:
>
>I vim the v1.metadata.json, so old .v1.metadata.json.crc is not
> match, how can I generate a new .v1.metadata.json.crc for the new 
> v1.metadata.json
> ?
>
> Thx
> liubo07199
> liubo07...@hellobike.com
>
> 
>
>


Re: how to test row level delete

2020-12-27 Thread OpenInx
> you can apply this patch in your own repository

The patch is : https://github.com/apache/iceberg/pull/1978

On Mon, Dec 28, 2020 at 10:32 AM OpenInx  wrote:

> Hi liubo07199
>
> Thanks for testing the iceberg row-level delete,  I skimmed the code, it
> seems you were trying the equality-delete feature.  For iceberg users, I
> think we don't have to write those iceberg internal codes to get this work,
> this isn't friendly for users.  Instead,  we usually use the
> equality-delete ( CDC events ingestion or flink aggregation upsert
> streams)  feature based on the compute-engine work. Currently,  we've
> supported the flink cdc-events integration (Flink Datastream integration
> has been merged [1] while the Flink SQL integration depends on the time
> when we are ready to expose iceberg format v2 [2])
>
> About what's the time to expose format v2 to users, you may want to read
> this mail [3].
>
> If you just want to have a basic test for writing cdc by flink,  you can
> apply this patch in your own repository,  and then create an iceberg table
> with an extra option like the following:
>
> public static Table createTable(String path, Map properties, 
> boolean partitioned) {
>   PartitionSpec spec;
>   if (partitioned) {
> spec = PartitionSpec.builderFor(SCHEMA).identity("data").build();
>   } else {
> spec = PartitionSpec.unpartitioned();
>   }
>   properties.put(TableProperties.FORMAT_VERSION, "2");
>   return new HadoopTables().create(SCHEMA, spec, properties, path);
> }
>
> Then use the flink data stream api or flink sql to write the cdc events
> into an apache iceberg table.  For data stream job to sinking cdc events I
> suggest to use the similar way here [4].
>
> I'd like to help if you have further feedback.
>
> Thanks.
>
> [1]. https://github.com/apache/iceberg/pull/1974
> [2]. https://github.com/apache/iceberg/pull/1978
> [3].
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202012.mbox/%3CCACc8XkGt%2B5kxr-XRMgY1eUKjd70mej38KFbhDuV2MH3AVMON2g%40mail.gmail.com%3E
> [4].
> https://github.com/apache/iceberg/pull/1974/files#diff-13e2e5b52d0effe51e1b470df77cb08b5ec8cc4f3a7f0fd4e51ee212fc83f76aR143
>
> On Sat, Dec 26, 2020 at 7:14 PM 1  wrote:
>
>> Hi, all:
>>
>> I want to try row level delete, but get the exception : 
>> IllegalArgumentException:
>> Cannot write delete files in a v1 table.
>> I look over https://iceberg.apache.org/spec/#table-metadata for
>> format-version, it said that An integer version number for the format.
>> Currently, this is always 1. Implementations must throw an exception if a
>> table’s version is higher than the supported version. so what can i do to
>> test row-level deletion ?
>>   So what can I do to have a try to row level delete?  how to create a
>> v2 table ?
>>
>> thx
>>
>> Code is :
>>
>> private static void deleteRead() throws IOException {
>> Schema deleteRowSchema = table.schema().select("id");
>> Record dataDelete = GenericRecord.create(deleteRowSchema);
>> List dataDeletes = Lists.newArrayList(
>> dataDelete.copy("id", 11), // id = 29
>> dataDelete.copy("id", 12), // id = 89
>> dataDelete.copy("id", 13) // id = 122
>> );
>>
>> DeleteFile eqDeletes = writeDeleteFile(table, 
>> Files.localOutput(tmpFile), dataDeletes, deleteRowSchema);
>>
>> table.newRowDelta()
>> .addDeletes(eqDeletes)
>> .commit();
>> }
>>
>> private static DeleteFile writeDeleteFile(Table table, OutputFile out,
>>  List deletes, Schema 
>> deleteRowSchema) throws IOException {
>> EqualityDeleteWriter writer = Parquet.writeDeletes(out)
>> .forTable(table)
>> .withPartition(Row.of("20201221"))
>> .rowSchema(deleteRowSchema)
>> .createWriterFunc(GenericParquetWriter::buildWriter)
>> .overwrite()
>> 
>> .equalityFieldIds(deleteRowSchema.columns().stream().mapToInt(Types.NestedField::fieldId).toArray())
>> .buildEqualityWriter();
>>
>> try (Closeable toClose = writer) {
>> writer.deleteAll(deletes);
>> }
>>
>> return writer.toDeleteFile();
>> }
>>
>> liubo07199
>> liubo07...@hellobike.com
>>
>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1=liubo07199=liubo07199%40hellobike.com=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png=%5B%22liubo07199%40hellobike.com%22%5D>
>>
>


Re: how to test row level delete

2020-12-27 Thread OpenInx
Hi liubo07199

Thanks for testing the iceberg row-level delete,  I skimmed the code, it
seems you were trying the equality-delete feature.  For iceberg users, I
think we don't have to write those iceberg internal codes to get this work,
this isn't friendly for users.  Instead,  we usually use the
equality-delete ( CDC events ingestion or flink aggregation upsert
streams)  feature based on the compute-engine work. Currently,  we've
supported the flink cdc-events integration (Flink Datastream integration
has been merged [1] while the Flink SQL integration depends on the time
when we are ready to expose iceberg format v2 [2])

About what's the time to expose format v2 to users, you may want to read
this mail [3].

If you just want to have a basic test for writing cdc by flink,  you can
apply this patch in your own repository,  and then create an iceberg table
with an extra option like the following:

public static Table createTable(String path, Map
properties, boolean partitioned) {
  PartitionSpec spec;
  if (partitioned) {
spec = PartitionSpec.builderFor(SCHEMA).identity("data").build();
  } else {
spec = PartitionSpec.unpartitioned();
  }
  properties.put(TableProperties.FORMAT_VERSION, "2");
  return new HadoopTables().create(SCHEMA, spec, properties, path);
}

Then use the flink data stream api or flink sql to write the cdc events
into an apache iceberg table.  For data stream job to sinking cdc events I
suggest to use the similar way here [4].

I'd like to help if you have further feedback.

Thanks.

[1]. https://github.com/apache/iceberg/pull/1974
[2]. https://github.com/apache/iceberg/pull/1978
[3].
https://mail-archives.apache.org/mod_mbox/iceberg-dev/202012.mbox/%3CCACc8XkGt%2B5kxr-XRMgY1eUKjd70mej38KFbhDuV2MH3AVMON2g%40mail.gmail.com%3E
[4].
https://github.com/apache/iceberg/pull/1974/files#diff-13e2e5b52d0effe51e1b470df77cb08b5ec8cc4f3a7f0fd4e51ee212fc83f76aR143

On Sat, Dec 26, 2020 at 7:14 PM 1  wrote:

> Hi, all:
>
> I want to try row level delete, but get the exception : 
> IllegalArgumentException:
> Cannot write delete files in a v1 table.
> I look over https://iceberg.apache.org/spec/#table-metadata for
> format-version, it said that An integer version number for the format.
> Currently, this is always 1. Implementations must throw an exception if a
> table’s version is higher than the supported version. so what can i do to
> test row-level deletion ?
>   So what can I do to have a try to row level delete?  how to create a v2
> table ?
>
> thx
>
> Code is :
>
> private static void deleteRead() throws IOException {
> Schema deleteRowSchema = table.schema().select("id");
> Record dataDelete = GenericRecord.create(deleteRowSchema);
> List dataDeletes = Lists.newArrayList(
> dataDelete.copy("id", 11), // id = 29
> dataDelete.copy("id", 12), // id = 89
> dataDelete.copy("id", 13) // id = 122
> );
>
> DeleteFile eqDeletes = writeDeleteFile(table, Files.localOutput(tmpFile), 
> dataDeletes, deleteRowSchema);
>
> table.newRowDelta()
> .addDeletes(eqDeletes)
> .commit();
> }
>
> private static DeleteFile writeDeleteFile(Table table, OutputFile out,
>  List deletes, Schema 
> deleteRowSchema) throws IOException {
> EqualityDeleteWriter writer = Parquet.writeDeletes(out)
> .forTable(table)
> .withPartition(Row.of("20201221"))
> .rowSchema(deleteRowSchema)
> .createWriterFunc(GenericParquetWriter::buildWriter)
> .overwrite()
> 
> .equalityFieldIds(deleteRowSchema.columns().stream().mapToInt(Types.NestedField::fieldId).toArray())
> .buildEqualityWriter();
>
> try (Closeable toClose = writer) {
> writer.deleteAll(deletes);
> }
>
> return writer.toDeleteFile();
> }
>
> liubo07199
> liubo07...@hellobike.com
>
> 
>


Re: What's the time to expose iceberg format v2 to end users ?

2020-12-18 Thread OpenInx
Thanks Yan for the document,  I will take a look at it, and see what I can
do.

On Fri, Dec 18, 2020 at 3:38 AM Yan Yan  wrote:

> Hi OpenInx,
>
> Thanks for bringing this up. I am currently working on Format v2 blocking
> tasks, and am maintaining a full list of blocking tasks with their
> description and current status here
> <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit?usp=sharing>
>  after
> speaking with Ryan a while ago, which covers all open issues listed in the
> github milestone <https://github.com/apache/iceberg/milestone/7> plus
> some others brought up by people during community sync. It would be great
> if you are interested in collaborating/code reviewing!
>
> Everyone please feel free to let me know/update the doc if you see any
> item missing/described inaccurately.
>
> Thanks,
> Yan
>
> On Wed, Dec 16, 2020 at 11:03 PM OpenInx  wrote:
>
>> Hi
>>
>> I wrote this email to align with the community about the time to expose
>> format v2 to end users.
>>
>> In iceberg format v2,  we've accomplished the row-level delete.  It's
>> designed for two user cases:
>>
>> 1.  Execute a single query to update or delete lots of rows.  It's a
>> typical batch update/delete job,  which is suitable for GDPR  or the case
>> that we want to correct the wrong data.
>> 2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that
>> the upper layer  compute engines could  analyze the change log in minutes.
>> It's almost ready in the current master branch for flink integration.
>>
>>
>> I'm not quite sure what's the blocker about the iceberg format v2 now.
>> I'd love to resolve those blockers if there're some.
>>
>> Thanks.
>>
>


What's the time to expose iceberg format v2 to end users ?

2020-12-16 Thread OpenInx
Hi

I wrote this email to align with the community about the time to expose
format v2 to end users.

In iceberg format v2,  we've accomplished the row-level delete.  It's
designed for two user cases:

1.  Execute a single query to update or delete lots of rows.  It's a
typical batch update/delete job,  which is suitable for GDPR  or the case
that we want to correct the wrong data.
2.  Write the real-time CDC/UPSERT stream to the iceberg table, so that the
upper layer  compute engines could  analyze the change log in minutes.
It's almost ready in the current master branch for flink integration.


I'm not quite sure what's the blocker about the iceberg format v2 now.  I'd
love to resolve those blockers if there're some.

Thanks.


Re: [VOTE] Release Apache Iceberg 0.10.0 RC4

2020-11-03 Thread OpenInx
+1 for 0.10.0 RC4

1. Download the source tarball, signature (.asc), and checksum (.sha512):
 OK
2. Import gpg keys: download KEYS and run gpg --import
/path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
3. Verify the signature by running: gpg --verify
apache-iceberg-xx.tar.gz.asc:  OK
4. Verify the checksum by running: sha512sum -c
apache-iceberg-xx.tar.gz.sha512 :  OK
5. Untar the archive and go into the source directory: tar xzf
apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
6. Run RAT checks to validate license headers: dev/check-license: OK
7. Build and test the project: ./gradlew build (use Java 8) :   OK

On Wed, Nov 4, 2020 at 8:25 AM Anton Okolnychyi
 wrote:

> Hi everyone,
>
> I propose the following RC to be released as official Apache Iceberg
> 0.10.0 release.
>
> The commit id is d39fad00b7dded98121368309f381473ec21e85f
> * This corresponds to the tag: apache-iceberg-0.10.0-rc4
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.10.0-rc4
> *
> https://github.com/apache/iceberg/tree/d39fad00b7dded98121368309f381473ec21e85f
>
> The release tarball, signature, and checksums are here:
> *
> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.10.0-rc4/
>
> You can find the KEYS file here (make sure to import the new key that was
> used to sign the release):
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged in Nexus. The Maven repository URL
> is:
> * https://repository.apache.org/content/repositories/orgapacheiceberg-1012
>
> This release includes important changes:
>
> * Flink support
> * Hive read support
> * ORC support fixes and improvements
> * Application of row-level delete files on read
> * Snapshot partition summary
> * Ability to load LocationProvider dynamically
> * Sort spec
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.10.0
> [ ] +0
> [ ] -1 Do not release this because…
>
> Thanks,
> Anton
>


Re: [VOTE] Release Apache Iceberg 0.10.0 RC2

2020-11-03 Thread OpenInx
Hi Ryan

I raise this question because I want to make sure whether it's a bug in the
current code base,  if sure then we will need to fix this before the next
0.10.0 RC.

Now I think we've confirmed that it's NOT a bug,  so let's continue the
next release RC voting.

Thanks for the attention.

On Wed, Nov 4, 2020 at 1:31 AM Ryan Blue  wrote:

> OpenInx, is that a general question or is it related to the release? It
> doesn't look related, but I want to make sure.
>
> On Tue, Nov 3, 2020 at 5:41 AM OpenInx  wrote:
>
>> Hi
>>
>> I will suggest taking a look at  this discussion:
>> https://github.com/apache/iceberg/pull/1391#discussion_r516518978.
>>
>> It sounds like a bug.
>>
>> Thanks.
>>
>>
>> On Tue, Nov 3, 2020 at 12:08 PM Anton Okolnychyi
>>  wrote:
>>
>>> -1 due to the ORC correctness bug. I’ll build a new RC.
>>>
>>> Thanks everyone for checking the candidate!
>>>
>>> - Anton
>>>
>>> On 2 Nov 2020, at 19:01, Ryan Blue  wrote:
>>>
>>> I think we are going to build another RC to get the ORC correctness
>>> problem that Shardul just fixed in.
>>>
>>> On Mon, Nov 2, 2020 at 6:40 PM Jingsong Li 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> 1. Download the source tarball, signature (.asc), and checksum
>>>> (.sha512):   OK
>>>> 2. Import gpg keys: download KEYS and run gpg --import
>>>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>>>> 3. Verify the signature by running: gpg --verify
>>>> apache-iceberg-xx-incubating.tar.gz.asc:  OK
>>>> 4. Verify the checksum by running: sha512sum -c
>>>> apache-iceberg-xx-incubating.tar.gz.sha512 :  OK
>>>> 5. Untar the archive and go into the source directory: tar xzf
>>>> apache-iceberg-xx-incubating.tar.gz && cd apache-iceberg-xx-incubating:  OK
>>>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>>>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>>>
>>>> Best,
>>>> Jingsong
>>>>
>>>> On Tue, Nov 3, 2020 at 7:10 AM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> Anton, thanks for the interest. It's across modules, and looks to be
>>>>> consistent. Looks like others have no problem with UT runs, probably need
>>>>> to have another mail thread or Github issue for this.
>>>>>
>>>>> On Tue, Nov 3, 2020 at 1:31 AM Anton Okolnychyi 
>>>>> wrote:
>>>>>
>>>>>> Jungtaek, do you hit this issue only in a specific suite/module? Is
>>>>>> it consistent?
>>>>>>
>>>>>> I remember we had similar issues for our Spark tests. If I am not
>>>>>> mistaken, the underlying problem was related to the number of connections
>>>>>> we were making to HMS and the pool size that was configured in our test 
>>>>>> HMS.
>>>>>>
>>>>>> - Anton
>>>>>>
>>>>>> On 2 Nov 2020, at 07:02, Ryan Murray  wrote:
>>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Ran through steps 1-7, completed successfully. Also tested locally
>>>>>> against nessie and all looked good.
>>>>>>
>>>>>> Best,
>>>>>> Ryan
>>>>>>
>>>>>> On Mon, Nov 2, 2020 at 2:28 PM Mass Dosage 
>>>>>> wrote:
>>>>>>
>>>>>>> +1 (non-binding)
>>>>>>>
>>>>>>> I ran the RC against a set of integration tests I have for a subset
>>>>>>> of the Hive2 read functionality on a distributed cluster and it worked 
>>>>>>> fine.
>>>>>>>
>>>>>>> On Mon, 2 Nov 2020 at 04:05, Simon Su  wrote:
>>>>>>>
>>>>>>>> + 1 (non-binding)
>>>>>>>> 1. Build code pass all UTs.
>>>>>>>> 2. Test Flink iceberg sink failover, test exactly-once.
>>>>>>>>
>>>>>>>> Junjie Chen  于2020年11月2日周一 上午11:35写道:
>>>>>>>>
>>>>>>>>> + 1 (non-binding)
>>>>>>>>>
>>>>>>>>> I ran step 1-7 on my cloud virtual machine (centos 7, java
>>>>>>>>&

Re: [VOTE] Release Apache Iceberg 0.10.0 RC2

2020-11-03 Thread OpenInx
Hi

I will suggest taking a look at  this discussion:
https://github.com/apache/iceberg/pull/1391#discussion_r516518978.

It sounds like a bug.

Thanks.


On Tue, Nov 3, 2020 at 12:08 PM Anton Okolnychyi
 wrote:

> -1 due to the ORC correctness bug. I’ll build a new RC.
>
> Thanks everyone for checking the candidate!
>
> - Anton
>
> On 2 Nov 2020, at 19:01, Ryan Blue  wrote:
>
> I think we are going to build another RC to get the ORC correctness
> problem that Shardul just fixed in.
>
> On Mon, Nov 2, 2020 at 6:40 PM Jingsong Li  wrote:
>
>> +1
>>
>> 1. Download the source tarball, signature (.asc), and checksum
>> (.sha512):   OK
>> 2. Import gpg keys: download KEYS and run gpg --import
>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>> 3. Verify the signature by running: gpg --verify
>> apache-iceberg-xx-incubating.tar.gz.asc:  OK
>> 4. Verify the checksum by running: sha512sum -c
>> apache-iceberg-xx-incubating.tar.gz.sha512 :  OK
>> 5. Untar the archive and go into the source directory: tar xzf
>> apache-iceberg-xx-incubating.tar.gz && cd apache-iceberg-xx-incubating:  OK
>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>
>> Best,
>> Jingsong
>>
>> On Tue, Nov 3, 2020 at 7:10 AM Jungtaek Lim 
>> wrote:
>>
>>> Anton, thanks for the interest. It's across modules, and looks to be
>>> consistent. Looks like others have no problem with UT runs, probably need
>>> to have another mail thread or Github issue for this.
>>>
>>> On Tue, Nov 3, 2020 at 1:31 AM Anton Okolnychyi 
>>> wrote:
>>>
>>>> Jungtaek, do you hit this issue only in a specific suite/module? Is it
>>>> consistent?
>>>>
>>>> I remember we had similar issues for our Spark tests. If I am not
>>>> mistaken, the underlying problem was related to the number of connections
>>>> we were making to HMS and the pool size that was configured in our test 
>>>> HMS.
>>>>
>>>> - Anton
>>>>
>>>> On 2 Nov 2020, at 07:02, Ryan Murray  wrote:
>>>>
>>>> +1 (non-binding)
>>>>
>>>> Ran through steps 1-7, completed successfully. Also tested locally
>>>> against nessie and all looked good.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>>> On Mon, Nov 2, 2020 at 2:28 PM Mass Dosage 
>>>> wrote:
>>>>
>>>>> +1 (non-binding)
>>>>>
>>>>> I ran the RC against a set of integration tests I have for a subset of
>>>>> the Hive2 read functionality on a distributed cluster and it worked fine.
>>>>>
>>>>> On Mon, 2 Nov 2020 at 04:05, Simon Su  wrote:
>>>>>
>>>>>> + 1 (non-binding)
>>>>>> 1. Build code pass all UTs.
>>>>>> 2. Test Flink iceberg sink failover, test exactly-once.
>>>>>>
>>>>>> Junjie Chen  于2020年11月2日周一 上午11:35写道:
>>>>>>
>>>>>>> + 1 (non-binding)
>>>>>>>
>>>>>>> I ran step 1-7 on my cloud virtual machine (centos 7, java
>>>>>>> 1.8.0_171), all passed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 2, 2020 at 10:36 AM OpenInx  wrote:
>>>>>>>
>>>>>>>> +1 for 0.10.0 RC2
>>>>>>>>
>>>>>>>> 1. Download the source tarball, signature (.asc), and checksum
>>>>>>>> (.sha512):   OK
>>>>>>>> 2. Import gpg keys: download KEYS and run gpg --import
>>>>>>>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>>>>>>>> 3. Verify the signature by running: gpg --verify
>>>>>>>> apache-iceberg-xx-incubating.tar.gz.asc:  OK
>>>>>>>> 4. Verify the checksum by running: sha512sum -c
>>>>>>>> apache-iceberg-xx-incubating.tar.gz.sha512 :  OK
>>>>>>>> 5. Untar the archive and go into the source directory: tar xzf
>>>>>>>> apache-iceberg-xx-incubating.tar.gz && cd 
>>>>>>>> apache-iceberg-xx-incubating:  OK
>>>>>>>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>>&

Re: Plans for the future iceberg 0.11.0 release

2020-11-01 Thread OpenInx
Thanks for your context about FLIP-27, Steven !

I will take a look for the patches under issues 1626.

On Sat, Oct 31, 2020 at 2:03 AM Steven Wu  wrote:

> OpenInx, thanks a lot for kicking off the discussion. Looks like my
> previous reply didn't reach the mailing list.
>
> > flink source based on the new FLIP-27 interface
>
> Yes, we shall target 0.11.0 release for the FLIP-27 flink source. I have
> updated the issue [1] with the following scopes.
>
>- Support both static/batch and continuous/streaming enumeration modes
>- Support only the simple assigner with no ordering/locality guarantee
>when handing out split assignment. But make the interface flexible to plug
>in different assigners (like the event time alignment assigner or locality
>aware assigner)
>- It will be @Experimenta status as nobody has run FLIP-27 sources in
>production today. Flink 1.12.0 release (ETA end of Nov) will have the first
>set of sources (Kafka and file) implemented with FLIP-27 source framework.
>We still need to gain more production experiences.
>
>
> [1] https://github.com/apache/iceberg/issues/1626
>
> On Wed, Oct 28, 2020 at 12:15 AM OpenInx  wrote:
>
>> Hi  dev
>>
>> As we know, we will be happy to cut the iceberg 0.10.0 candidate release
>> this week.  I think it may be the time to plan for the future iceberg
>> 0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]
>>
>> I put the following issues into the newly created milestone:
>>
>> 1.   Apache Flink Rewrite Actions in Apache Iceberg.
>>
>> It's possible that we encounter too many small files issues when running
>> the iceberg flink sink in real production because of the frequent
>> checkpoint.  we have two approaches to handle the small files:
>>
>> a.  As the current spark rewrite actions designed,  flink will provide
>> the similar rewrite actions which will be running in a batch job.  It's
>> suitable to trigger the whole table or whole partitions compactions
>> periodically, because this kind of rewrites will compact many large files
>> and may consume lots of bandwidth.  Currently,   I and JunZheng are working
>> on this issue, and we've extracted the base rewrite actions between spark
>> module and flink module.  The next step would be implementing rewrite
>> actions in the flink module.
>>
>> b. Compact those small files in the flink streaming job when sinking into
>> iceberg tables. That means we will provide a new rewrite operator chaining
>> to the current IcebergFilesCommitter.  Once an iceberg transaction has been
>> committed, the newly introduced rewrite operator will check whether it
>> needs a small compaction. Those actions only choose a few tiny size files
>> (may be several KB, or MB, I think we could provide a configurable
>> threshold) to rewrite, which can be achieved with a minimum cost and a
>> higher efficiency of compaction.   Currently,  simonsssu from Tencent has
>> provided a WIP PR here [2]
>>
>>
>> 2. Allow to write CDC or UPSERT records by flink streaming jobs.
>>
>> We've almost implemented the row-level delete feature in the iceberg
>> master branch, but still lack the ability to integrate with compute engines
>> (to be precise,  we spark/flink could read the expected records if someone
>> has deleted the rows correctly but the write path is not available).  I am
>> preparing the patch for sinking CDC into iceberg by flink streaming job
>> here [3], I think it will be ready in the next few weeks.
>>
>> 3.  Apache flink streaming reader.
>>
>> We've prepared a POC version in our alibaba internal branch, but still
>> not contribute to apache iceberg now.  I think it's worth accomplishing
>> that in the following days.
>>
>>
>> The above are the issues that I think it's worth to merge before iceberg
>> 0.11.0.  But  I' not quite sure what's the plan for the things:
>>
>> 1.  I know @Anton Okolnychyi  is working on
>> spark-sql extensions for iceberg, I guess there's a high probability to get
>> that ?  [4]
>>
>> 2.  @Steven Wu  from netflix is working on flink
>> source based on the new FLIP-27 interface,  thoughts ? [5]
>>
>> 3.  How about the Spark Row-Delete integration work ?
>>
>>
>>
>> [1].  https://github.com/apache/iceberg/milestone/12
>> [2]. https://github.com/apache/iceberg/pull/1669/files
>> [3]. https://github.com/apache/iceberg/pull/1663
>> [4]. https://github.com/apache/iceberg/milestone/11
>> [5]. https://github.com/apache/iceberg/issues/1626
>>
>


Plans for the future iceberg 0.11.0 release

2020-10-28 Thread OpenInx
Hi  dev

As we know, we will be happy to cut the iceberg 0.10.0 candidate release
this week.  I think it may be the time to plan for the future iceberg
0.11.0 now, so I created a Java 0.11.0 Release milestone here [1]

I put the following issues into the newly created milestone:

1.   Apache Flink Rewrite Actions in Apache Iceberg.

It's possible that we encounter too many small files issues when running
the iceberg flink sink in real production because of the frequent
checkpoint.  we have two approaches to handle the small files:

a.  As the current spark rewrite actions designed,  flink will provide the
similar rewrite actions which will be running in a batch job.  It's
suitable to trigger the whole table or whole partitions compactions
periodically, because this kind of rewrites will compact many large files
and may consume lots of bandwidth.  Currently,   I and JunZheng are working
on this issue, and we've extracted the base rewrite actions between spark
module and flink module.  The next step would be implementing rewrite
actions in the flink module.

b. Compact those small files in the flink streaming job when sinking into
iceberg tables. That means we will provide a new rewrite operator chaining
to the current IcebergFilesCommitter.  Once an iceberg transaction has been
committed, the newly introduced rewrite operator will check whether it
needs a small compaction. Those actions only choose a few tiny size files
(may be several KB, or MB, I think we could provide a configurable
threshold) to rewrite, which can be achieved with a minimum cost and a
higher efficiency of compaction.   Currently,  simonsssu from Tencent has
provided a WIP PR here [2]


2. Allow to write CDC or UPSERT records by flink streaming jobs.

We've almost implemented the row-level delete feature in the iceberg master
branch, but still lack the ability to integrate with compute engines (to be
precise,  we spark/flink could read the expected records if someone has
deleted the rows correctly but the write path is not available).  I am
preparing the patch for sinking CDC into iceberg by flink streaming job
here [3], I think it will be ready in the next few weeks.

3.  Apache flink streaming reader.

We've prepared a POC version in our alibaba internal branch, but still not
contribute to apache iceberg now.  I think it's worth accomplishing that in
the following days.


The above are the issues that I think it's worth to merge before iceberg
0.11.0.  But  I' not quite sure what's the plan for the things:

1.  I know @Anton Okolnychyi  is working on
spark-sql extensions for iceberg, I guess there's a high probability to get
that ?  [4]

2.  @Steven Wu  from netflix is working on flink
source based on the new FLIP-27 interface,  thoughts ? [5]

3.  How about the Spark Row-Delete integration work ?



[1].  https://github.com/apache/iceberg/milestone/12
[2]. https://github.com/apache/iceberg/pull/1669/files
[3]. https://github.com/apache/iceberg/pull/1663
[4]. https://github.com/apache/iceberg/milestone/11
[5]. https://github.com/apache/iceberg/issues/1626


Re: Several flink pull requests need to get merged before the next release 0.10.0

2020-10-27 Thread OpenInx
Hi Ryan

Is it the correct time once we get the PR 1477 merged ?  Do we have any
other blockers for the coming release 0.10.0 ?

Thanks.

On Wed, Oct 21, 2020 at 9:13 AM Ryan Blue  wrote:

> Hey, thanks for bringing these up. I'm planning on spending some time
> reviewing tomorrow and I can take a look at the first two.
>
> I just merged the first one since it was small, thanks for the fix! Feel
> free to ping me or other committers to review these. I do think it is
> important to have a committer review, even if the community also has
> positive reviews.
>
> rb
>
> On Mon, Oct 19, 2020 at 7:15 PM OpenInx  wrote:
>
>> Hi
>>
>> As we know that we next release 0.10.0 is coming, there are several
>> issues which should be merged as soon as possible in my mind:
>>
>> 1. https://github.com/apache/iceberg/pull/1477
>>
>> It will change the flink state design to maintain the complete data files
>> into manifest before checkpoint finished,  it good for minimal the flink
>> state size and improve the state compatibility (Before that we will
>> serialize the DataFile into flink state backend, while DataFile class have
>> depended some java serializable classes, the means if we change the
>> dependencies classes,  it may fail to deserialize the state).  Currently,
>> I gained a +1 from  Steven Zhen Wu, thanks for his patient reviewing.
>> According to the apache rule,  I need another +1 from iceberg committers,
>> anyone have time to get the review finished ?
>>
>> 2. https://github.com/apache/iceberg/pull/1586
>>
>> This will introduce options to load the external hive-site.xml for flink
>> hive catalog, which is really helpful for production environment, not a
>> hard change.  But will still need a review from iceberg members.  Thanks.
>>
>> 3. https://github.com/apache/iceberg/pull/1619
>>
>> We introduced another write parallelism for iceberg flink stream writers.
>> Thanks kbendick and Stevenzwu for the reviewing,  gain two +1 now.  Should
>> I merge this ?
>>
>>
>> Besides the flink PRs,  it is very beneficial to put forward other
>> related issues which is blocking the release 0.10.0 .  I am happy to help
>> resolve these issues.
>>
>> Thanks.
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Several flink pull requests need to get merged before the next release 0.10.0

2020-10-19 Thread OpenInx
Hi

As we know that we next release 0.10.0 is coming, there are several issues
which should be merged as soon as possible in my mind:

1. https://github.com/apache/iceberg/pull/1477

It will change the flink state design to maintain the complete data files
into manifest before checkpoint finished,  it good for minimal the flink
state size and improve the state compatibility (Before that we will
serialize the DataFile into flink state backend, while DataFile class have
depended some java serializable classes, the means if we change the
dependencies classes,  it may fail to deserialize the state).  Currently,
I gained a +1 from  Steven Zhen Wu, thanks for his patient reviewing.
According to the apache rule,  I need another +1 from iceberg committers,
anyone have time to get the review finished ?

2. https://github.com/apache/iceberg/pull/1586

This will introduce options to load the external hive-site.xml for flink
hive catalog, which is really helpful for production environment, not a
hard change.  But will still need a review from iceberg members.  Thanks.

3. https://github.com/apache/iceberg/pull/1619

We introduced another write parallelism for iceberg flink stream writers.
Thanks kbendick and Stevenzwu for the reviewing,  gain two +1 now.  Should
I merge this ?


Besides the flink PRs,  it is very beneficial to put forward other related
issues which is blocking the release 0.10.0 .  I am happy to help resolve
these issues.

Thanks.


Re: Incremental reads for Upsert!

2020-10-19 Thread OpenInx
Yeah, we have discussed the incremental readers server times, here is the
conclusion [1].   I also wrote a document to show the thoughts behind the
discussion[2], you might be interested in it.

In my opinion, the next release 0.10.0 will include the basic flink sink
connector and batch reader.  and I expect that the following 0.11.0 will
include the cdc ingestion and streaming(append-only log, not CDC) reader
for flink.  CDC streaming readers may be included in 0.12.0 in my opinion.

1. https://github.com/apache/iceberg/issues/360#issuecomment-653532308
2.
https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.ljqc7bxmc6ej


On Tue, Oct 20, 2020 at 4:43 AM Ryan Blue  wrote:

> Hi Ashish,
>
> We've discussed this use case quite a bit, but I don't think that there
> are currently any readers that expose the deletes as a stream. Right now,
> all of the readers produce records from the current tables state. I think
> @OpenInx  and @Jingsong Li  have
> some plans to expose such a reader for Flink, though. Maybe they can work
> with you to on some milestones and a roadmap.
>
> rb
>
> On Fri, Oct 16, 2020 at 11:28 AM Ashish Mehta 
> wrote:
>
>> Hi,
>>
>> Is there a spec/proposal/milestone issue that talks about Incremental
>> reads for UPSERT? i.e. allowing clients to read a dataSet's APPEND/DELETES
>> with some options of exposing actually deleted rows.
>>
>> I am in view that exposing deleted rows might be trivial with positional
>> deletes, so an option might be actually helpful if the client would be
>> creating positional deletes in his use case.
>>
>> Thanks,
>> Ashish
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg V2 Spec

2020-09-20 Thread OpenInx
Thanks for the great work from Ryan,  I would be glad to be the reviewer !

On Sat, Sep 19, 2020 at 7:37 AM Ryan Blue  wrote:

> I'm working on an update to the spec. We've completed the Java library
> implementation end-to-end, so now we have working code that will be
> released in 0.10.0. Next step is the spec update to document everything now
> that we're confident that it works as expected.
>
> Look for a PR in the next few days. It would be great to have more
> reviewers!
>
> rb
>
> On Mon, Sep 14, 2020 at 9:22 AM Chen Song  wrote:
>
>> I want to follow up on this. Is there an official consolidated design
>> doc/proposal (even wip) on V2 spec?
>>
>> I saw Streaming CDC in Iceberg
>> <https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#heading=h.2u29lq1ekp5r>
>>  in
>> a few update emails related, but it only covers one part.
>>
>> Chen
>>
>> On Thu, Jul 2, 2020 at 9:53 PM OpenInx  wrote:
>>
>>> Sounds good to me.
>>>
>>> Thanks.
>>>
>>> On Fri, Jul 3, 2020 at 12:58 AM Ryan Blue  wrote:
>>>
>>>> I'd like to get 0.9.0 out as soon as possible. I expect to get an early
>>>> RC out next week, once we have more tests committed. That way, people can
>>>> start trying it out and reporting back where it doesn't work.
>>>>
>>>> I'd rather not block 0.9.0 to wait on Flink connector components.
>>>> There's still a lot of work to get in, so I think it would be good to keep
>>>> these decoupled. That said, I think it would make sense to have a release
>>>> once the Flink connector is ready, just like we would do for Spark 3
>>>> support.
>>>>
>>>> Does that sound reasonable?
>>>>
>>>> On Wed, Jul 1, 2020 at 7:39 PM OpenInx  wrote:
>>>>
>>>>> Hi Ryan:
>>>>>
>>>>> Just curious when do we plan to release 0.9.0 ?  I expect that the
>>>>> flink connector could be included in release 0.9.0.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> On Thu, Jul 2, 2020 at 12:14 AM Ryan Blue 
>>>>> wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>> Right now, the main parts of the v2 spec are the addition of sequence
>>>>>> numbers and delete files. We're also making some other requirements more
>>>>>> strict, but those are mainly cleaning up problems and not related to
>>>>>> row-level deletes.
>>>>>>
>>>>>> Upserts would be encoded as a delete and an insert. Deletes are
>>>>>> stored in delete files, and inserts are normal data files. Delete files 
>>>>>> are
>>>>>> valid within a partition, and apply to all data files with the same or
>>>>>> lower sequence number.
>>>>>>
>>>>>> I'm planning on updating what's currently in the spec now that we
>>>>>> have sequence numbers and delete file metadata committed in master, but
>>>>>> right now I'm working on getting the 0.9.0 release out with support for
>>>>>> Spark 3. The documentation should be coming in the next couple of weeks.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>> On Wed, Jul 1, 2020 at 6:28 AM Chen Song 
>>>>>> wrote:
>>>>>>
>>>>>>> I saw Table Spec V2
>>>>>>> <https://iceberg.apache.org/spec/#version-2-row-level-deletes> was
>>>>>>> mentioned in the official iceberg doc. I know it is incomplete and wip. 
>>>>>>> Is
>>>>>>> there any to-be-reviewed or proposed version for public view? I am
>>>>>>> interested to understand how row level upserts are supported?
>>>>>>>
>>>>>>> Thanks
>>>>>>> --
>>>>>>> Chen Song
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Netflix
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>>
>> --
>> Chen Song
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Timestamp Based Incremental Reading in Iceberg ...

2020-09-08 Thread OpenInx
I agree that  it's helpful to allow users to read the incremental delta
based timestamp,  as Jingsong said timestamp is more friendly.

My question is how to implement this ?

 If just attach the client's timestamp to the iceberg table when
committing,  then different clients may have different timestamp values
because of the skewing. In theory, these time values are not strictly
comparable, and can only be compared within the margin of error.


On Wed, Sep 9, 2020 at 10:06 AM Jingsong Li  wrote:

> +1 for timestamps are linear, in implementation, maybe the writer only
> needs to look at the previous snapshot timestamp.
>
> We're trying to think of iceberg as a message queue, Let's take the
> popular queue Kafka as an example,
> Iceberg has snapshotId and timestamp, corresponding, Kafka has offset and
> timestamp:
> - offset: It is used for incremental read, such as the state of a
> checkpoint in a computing system.
> - timestamp: It is explicitly specified by the user to specify the scope
> of consumption. As start_timestamp of reading. Timestamp is a better user
> aware interface. But offset/snapshotId is not human readable and friendly.
>
> So there are scenarios where timestamp is used for incremental read.
>
> Best,
> Jingsong
>
>
> On Wed, Sep 9, 2020 at 12:45 AM Sud  wrote:
>
>>
>> We are using incremental read for iceberg tables which gets quite few
>> appends ( ~500- 1000 per hour) . but instead of using timestamp we use
>> snapshot ids and track state of last read snapshot Id.
>> We are using timestamp as fallback when the state is incorrect, but as
>> you mentioned if timestamps are linear then it works as expected.
>> We also found that incremental reader might be slow when dealing with >
>> 2k snapshots in range. we are currently testing a manifest based
>> incremental reader which looks at manifest entries instead of scanning
>> snapshot history and accessing each snapshot.
>>
>> Is there any reason you can't use snapshot based incremental read?
>>
>> On Tue, Sep 8, 2020 at 9:06 AM Gautam  wrote:
>>
>>> Hello Devs,
>>>We are looking into adding workflows that read data
>>> incrementally based on commit time. The ability to read deltas between
>>> start / end commit timestamps on a table and ability to resume reading from
>>> last read end timestamp. In that regard, we need the timestamps to be
>>> linear in the current active snapshot history (newer versions always have
>>> higher timestamps). Although Iceberg commit flow ensures the versions are
>>> newer, there isn't a check to ensure timestamps are linear.
>>>
>>> Example flow, if two clients (clientA and clientB), whose time-clocks
>>> are slightly off (say by a couple seconds), are committing frequently,
>>> clientB might get to commit after clientA even if it's new snapshot
>>> timestamps is out of order. I might be wrong but I haven't found a check in
>>> HadoopTableOperations.commit() to ensure this above case does not happen.
>>>
>>> On the other hand, restricting commits due to out-of-order timestamps
>>> can hurt commit throughput so I can see why this isn't something Iceberg
>>> might want to enforce based on System.currentTimeMillis(). Although if
>>> clients had a way to define their own globally synchronized timestamps
>>> (using external service or some monotonically increasing UUID) then iceberg
>>> could allow an API to set that on the snapshot or use that instead of
>>> System.currentTimeMillis(). Iceberg exposes something similar using
>>> Sequence numbers in v2 format to track Deletes and Appends.
>>>
>>> Is this a concern others have? If so how are folks handling this today
>>> or are they not exposing such a feature at all due to the inherent
>>> distributed timing problem? Would like to hear how others are
>>> thinking/going about this. Thoughts?
>>>
>>> Cheers,
>>>
>>> -Gautam.
>>>
>>
>
> --
> Best, Jingsong Lee
>


Re: [DISCUSS] August board report

2020-08-13 Thread OpenInx
Thanks for the links,  Jacques.  I will try to create a pull request to
attach that sharing links.

On Thu, Aug 13, 2020 at 10:24 AM Jacques Nadeau  wrote:

> The conference was free so all the recordings are available on-demand for
> free:
> https://subsurfaceconf.com/summer2020/recordings
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Wed, Aug 12, 2020 at 7:07 PM OpenInx  wrote:
>
>> > Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>> Hive
>> queries against Iceberg tables and working with petabyte-scale Iceberg
>> tables.
>> Iceberg was also mentioned in the keynotes.
>>
>> Are there slides or videos about the two iceberg talks ? I'd like to
>> read/watch slides or videos but it seems I did not find the resources after
>> a few google.  How about creating a page to collect all those sharing (also
>> a 'power by' page) ?
>>
>>
>>
>> On Thu, Aug 13, 2020 at 7:50 AM Owen O'Malley 
>> wrote:
>>
>>> +1 looks good.
>>>
>>> On Wed, Aug 12, 2020 at 4:41 PM Ryan Blue  wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Here's a draft of the board report for this month. Please reply with
>>>> anything that you'd like to see added or that I've missed. Thanks!
>>>>
>>>> rb
>>>>
>>>> ## Description:
>>>> Apache Iceberg is a table format for huge analytic datasets that is
>>>> designed
>>>> for high performance and ease of use.
>>>>
>>>> ## Issues:
>>>> There are no issues requiring board attention.
>>>>
>>>> ## Membership Data:
>>>> Apache Iceberg was founded 2020-05-19 (2 months ago)
>>>> There are currently 10 committers and 9 PMC members in this project.
>>>> The Committer-to-PMC ratio is roughly 1:1.
>>>>
>>>> Community changes, past quarter:
>>>> - No new PMC members (project graduated recently).
>>>> - Shardul Mahadik was added as committer on 2020-07-25
>>>>
>>>> ## Project Activity:
>>>> 0.9.0 was released, including support for Spark 3 and SQL DDL commands,
>>>> support
>>>> for JDK 11, vectorized Parquet reads, and an action to compact data
>>>> files.
>>>>
>>>> Since the 0.9.0 release, the community has made progress in several
>>>> areas:
>>>> - The Hive StorageHandler now provides access to query Iceberg tables
>>>>   (work is ongoing to implement projection and predicate pushdown).
>>>> - Flink integration has made substantial progress toward using native
>>>> RowData,
>>>>   and the first stage of the Flink sink (data file writers) has been
>>>> committed.
>>>> - An action to expire snapshots using Spark was added and is an
>>>> improvement on
>>>>   the incremental approach because it compares the reachable file sets.
>>>> - The implementation of row-level deletes is nearing completion. Scan
>>>> planning
>>>>   now supports delete files, merge-based and set-based row filters have
>>>> been
>>>>   committed, and delete file writers are under review. The delete file
>>>> writers
>>>>   allow storing deleted row data in support of Flink CDC use cases.
>>>>
>>>> Releases:
>>>> - 0.9.0 was released on 2020-07-13
>>>> - 0.9.1 has an ongoing vote
>>>>
>>>> ## Community Health:
>>>> The month since the last report has been one of the busiest since the
>>>> project
>>>> started. 80 pull requests were merged in the last 4 weeks, and more
>>>> importantly,
>>>> came from 21 different contributors. Both of these are new high
>>>> watermarks.
>>>>
>>>> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>>>> Hive
>>>> queries against Iceberg tables and working with petabyte-scale Iceberg
>>>> tables.
>>>> Iceberg was also mentioned in the keynotes.
>>>>
>>>> --
>>>> Ryan Blue
>>>>
>>>


Re: [DISCUSS] August board report

2020-08-12 Thread OpenInx
> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
Hive
queries against Iceberg tables and working with petabyte-scale Iceberg
tables.
Iceberg was also mentioned in the keynotes.

Are there slides or videos about the two iceberg talks ? I'd like to
read/watch slides or videos but it seems I did not find the resources after
a few google.  How about creating a page to collect all those sharing (also
a 'power by' page) ?



On Thu, Aug 13, 2020 at 7:50 AM Owen O'Malley 
wrote:

> +1 looks good.
>
> On Wed, Aug 12, 2020 at 4:41 PM Ryan Blue  wrote:
>
>> Hi everyone,
>>
>> Here's a draft of the board report for this month. Please reply with
>> anything that you'd like to see added or that I've missed. Thanks!
>>
>> rb
>>
>> ## Description:
>> Apache Iceberg is a table format for huge analytic datasets that is
>> designed
>> for high performance and ease of use.
>>
>> ## Issues:
>> There are no issues requiring board attention.
>>
>> ## Membership Data:
>> Apache Iceberg was founded 2020-05-19 (2 months ago)
>> There are currently 10 committers and 9 PMC members in this project.
>> The Committer-to-PMC ratio is roughly 1:1.
>>
>> Community changes, past quarter:
>> - No new PMC members (project graduated recently).
>> - Shardul Mahadik was added as committer on 2020-07-25
>>
>> ## Project Activity:
>> 0.9.0 was released, including support for Spark 3 and SQL DDL commands,
>> support
>> for JDK 11, vectorized Parquet reads, and an action to compact data files.
>>
>> Since the 0.9.0 release, the community has made progress in several areas:
>> - The Hive StorageHandler now provides access to query Iceberg tables
>>   (work is ongoing to implement projection and predicate pushdown).
>> - Flink integration has made substantial progress toward using native
>> RowData,
>>   and the first stage of the Flink sink (data file writers) has been
>> committed.
>> - An action to expire snapshots using Spark was added and is an
>> improvement on
>>   the incremental approach because it compares the reachable file sets.
>> - The implementation of row-level deletes is nearing completion. Scan
>> planning
>>   now supports delete files, merge-based and set-based row filters have
>> been
>>   committed, and delete file writers are under review. The delete file
>> writers
>>   allow storing deleted row data in support of Flink CDC use cases.
>>
>> Releases:
>> - 0.9.0 was released on 2020-07-13
>> - 0.9.1 has an ongoing vote
>>
>> ## Community Health:
>> The month since the last report has been one of the busiest since the
>> project
>> started. 80 pull requests were merged in the last 4 weeks, and more
>> importantly,
>> came from 21 different contributors. Both of these are new high
>> watermarks.
>>
>> Community members gave 2 Iceberg talks at Subsurface Conf, on enabling
>> Hive
>> queries against Iceberg tables and working with petabyte-scale Iceberg
>> tables.
>> Iceberg was also mentioned in the keynotes.
>>
>> --
>> Ryan Blue
>>
>


Re: [DISCUSS] 0.9.1 release

2020-08-03 Thread OpenInx
> Does anyone know if we can recover existing data affected by it?

In the PR #1271, there are two data types which have correctness bugs:
decimal18 and timestampZone.

For decimal18,  we actually write the correct decimal value, but read it in
an incorrect way. saying the decimal(10,3) and value = 10.100, the orc
writer will store it in file as  101*10^(-1),  while before this patch we
will read it as 101*10^(-3).  If we use the scale=-1 to construct the
BigDecimal and then adjust to scale=3, then in theory we could still get
the correct decimal 10100*10^(-3).

For timestampZone,  I'd say that we've stored the wrong value in the file,
the error range between the written timestamp and correct timestamp should
be less than a few seconds.  Because here [1]  for negative value,  -5 / 2
= -2,  floorDiv(-5, 2) = -3,  the error range should be less than 1,  the
nanoseconds of timestamp is the value that is less than one second.  While
I did not get the way to recover the existing data.

1.
https://github.com/apache/iceberg/pull/1271/files#diff-5aa4840155ec70fdf7f725e122cde7b7L218



On Tue, Aug 4, 2020 at 3:08 AM Ryan Blue  wrote:

> Yes, we should get #1269 into a patch release as well since it is a
> correctness bug.
>
> Does anyone know if we can recover existing data affected by it?
>
> On Mon, Aug 3, 2020 at 11:08 AM Anton Okolnychyi 
> wrote:
>
>> I see a few open issues for ORC. Some of them seem critical (like issue
>> #1269). Do we want to fix those before the release? Or is ORC support still
>> experimental?
>>
>> - Anton
>>
>> On 1 Aug 2020, at 20:04, Jungtaek Lim 
>> wrote:
>>
>> Sure! I just submitted #1285
>>  to exclude the refactor.
>> Once #1285 is merged I'll rebase the existing PR to do the refactor. Thanks
>> for the input!
>>
>> On Sun, Aug 2, 2020 at 4:41 AM Ryan Blue 
>> wrote:
>>
>>> Thanks, Jungtaek! I agree it would be great to fix that problem. I took
>>> a quick look at the PR and it is a little big to go into a patch release
>>> since it refactors quite a few places to consolidate the list copy. What do
>>> you think about making a PR that just fixes the problem with
>>> BaseCombinedScanTask and Kryo, then doing the remainder of the refactor in
>>> master?
>>>
>>> On Fri, Jul 31, 2020 at 5:29 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 If we still have some more days I think #1280
 : "fix serialization
 issue in BaseCombinedScanTask with Kyro" is a good candidate to be
 included. The bug affects both Spark and Flink (according to #1279
 ).

 On Sat, Aug 1, 2020 at 8:04 AM Ryan Blue  wrote:

> Hi everyone,
>
> We’ve accumulated a few bug fixes in the last couple of weeks and I
> think it might make sense to get some of them out in an 0.9.1 release 
> since
> they make it harder to work with Iceberg. Here are the ones I know about:
>
>- #1282 : rewriteNot
>fails for binary and unary predicates
>- #1278 : Bad import
>from commons-compress causes query failures
>- #1251 : Fixes more
>imports from non-Iceberg Guava
>- #1283 : Query
>descriptions fail when IN predicates are pushed
>- #1228 : Data
>imports fail when paths include whitespace
>- #1194 : USING
>should set format when used in a CTAS command
>- #1203 : Table cache
>should not expire
>
> If there are no objections, I’ll get started and create a release
> branch. And please reply if there are other issues you’ve seen that should
> also be included in a patch release.
>
> rb
> --
> Ryan Blue
>

>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg community sync notes - 29 July 2020

2020-08-02 Thread OpenInx
Sorry about that I missed the community online sync.

> Flink integration

Let me provide more details about the flink integration progress. The work
we've done:
1.  the data type conversion between flink and iceberg.
2.  Currently we've parquet/avro row readers and writers, but since we've
upgraded flink to 1.11 which have turned its data type row RowData.  so
we ( Junjie -> parquet, Jingsong -> avro, Zheng Hu -> orc) are doing the
refactor to support parquet/avro/orc RowData readers and writers .
3.  We've abstracted the partitioned task writer and unpartitioned task
writer between flink and spark, so that different compute engines can share
it in the write path.
4.  Jingsong has committed a flink catalog implementation,  similar to the
spark catalog.

The next things we will work:
1. The flink sink connector will be composed by two operators in flink. The
first one is IcebergStreamWriter (collecting records and emitting DataFiles
to downstream IcebergFileCommitter),  the next one is IcebergFileCommitter,
which will commit the data files to iceberg in a transaction periodically.
Finishing the two operators is the first thing we need to do.
2.  Once we've finished the flink DataStream iceberg sink,  we will create
PRs  to make the flink table sql work.
3.  Flink streaming reader / batch reader etc.

> Kyle: I’ll be interested to review.

Thanks for your time to review those PR.

> It seems like points raised by @openinx in the CDC pipelines doc must be
resolved before moving on with any implementation.

The key problems described in this document are:
1. Fast ingestion;
2. Acceptable Batch Read performance;
3. An equivalent stream so that we could keep the eventual consistency
between source table and sink table.

In this issue[1], Ryan and I  have reached a rough agreement about the
design, saying we will propose a mixed equality-delete and pos-delete to
accomplish that three goals.  we may need to spend some time splitting up
these tasks.
1. https://github.com/apache/iceberg/issues/360

Thanks.

On Sat, Aug 1, 2020 at 8:17 AM Anton Okolnychyi
 wrote:

> Hey everyone,
>
> Here are my notes from the last sync. Feel free to add/correct.
>
> *Conferences*
>
> There are three talks on Iceberg at the Dremio conference.
> - "The Future of Intelligent Storage in Big Data" by Dan- "Hiveberg:
> Integrating Apache Iceberg with the Hive Metastore" by Adrian and Christine
> - "Lessons learned from running Apache Iceberg at PB scale" by Anton
>
> *Hive integration*
>
> Adrien: Found a bug when MR job is launched in distributed mode, @guilload
> and @rdsr are taking a look at it and will propose a fix soon.
> Adrien: It is hard to work with large tables as predicate push-down is not
> working. Waiting for a PR from @cmathiesen and @massdosage.
>
> *Flink integration*
>
> Junjie: There is some progress on the Flink sync and the work is split
> into smaller PRs that are getting merged into master.
> Kyle: I’ll be interested to review.
>
>
> *Row-level deletes*
> Anton: Most of the work for core metadata is done. We have delete
> manifests, sequence numbers, updated manifest lists.
> Junjie: There is progress on readers to project metadata columns like row
> position in Avro, Parquet, ORC.
> Anton: I was supposed to start working on two-phase job planning approach
> but was distracted by other things. Plan to resume looking into that.
> Anton: It seems like points raised by @openinx in the CDC pipelines
> doc must be resolved before moving on with any implementation.
>
> Could not get more details as neither Ryan nor Zheng was present.
>
> CDC open questions:
> https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc
>
>
> *SQL extensions*
> Anton: Thanks everyone for the feedback. Looks like we almost have
> consensus on how that should look like. There is one open question raised
> by Carl.
> Carl: How will the currently proposed approach that relies on stored
> procedures work with role-based access control? Presto has supports for
> this.
> Anton: We can limit the access to stored procedures but I don’t know how
> we can limit calling a stored procedure on a particular table if the table
> name as passed as an argument.
> Carl: It feels easier with ALTER TABLE syntax.
> Carl: It is better to follow up with the Presto community on this.
> Anton: Agreed. It is a blocker to move forward.
>
> Dev list discussion:
> https://lists.apache.org/thread.html/rb3321727198d65246ec9eb0f938b121ec6fe5dd0face0b2fb86a%40%3Cdev.iceberg.apache.org%3E
> <https://lists.apache.org/thread.html/rb3321727198d65246ec9eb0f938b121ec6fe5dd0face0b2fb86a@%3Cdev.iceberg.apache.org%3E>
>
> SQL extensions doc:
> https://docs.google.com/document/d/1Nf8c16R2hj4lSc-4sQg4oiUUV_F4XqZKt

Re: New committer: Shardul Mahadik

2020-07-22 Thread OpenInx
Congratulations !

On Thu, Jul 23, 2020 at 9:31 AM Jingsong Li  wrote:

> Congratulations Shardul! Well deserved!
>
> Best,
> Jingsong
>
> On Thu, Jul 23, 2020 at 7:27 AM Anton Okolnychyi
>  wrote:
>
>> Congrats and welcome! Keep up the good work!
>>
>> - Anton
>>
>> On 22 Jul 2020, at 16:02, RD  wrote:
>>
>> Congratulations Shardul! Well deserved!
>>
>> -Best,
>> R.
>>
>> On Wed, Jul 22, 2020 at 2:24 PM Ryan Blue  wrote:
>>
>>> Hi everyone,
>>>
>>> I'd like to congratulate Shardul Mahadik, who was just invited to join
>>> the Iceberg committers!
>>>
>>> Thanks for all your contributions, Shardul!
>>>
>>> rb
>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>
>>
>
> --
> Best, Jingsong Lee
>


Re: [VOTE] Release Apache Iceberg 0.9.0 RC5

2020-07-09 Thread OpenInx
I followed the verify guide here (
https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E)
:

1. Verify the signature: OK
2. Verify the checksum: OK
3. Untar the archive tarball: OK
4. Run RAT checks to validate license headers: RAT checks passed
5. Build and test the project: all unit tests passed.

+1 (non-binding).

On Fri, Jul 10, 2020 at 9:46 AM Ryan Blue  wrote:

> Hi everyone,
>
> I propose the following RC to be released as the official Apache Iceberg
> 0.9.0 release.
>
> The commit id is 4e66b4c10603e762129bc398146e02d21689e6dd
> * This corresponds to the tag: apache-iceberg-0.9.0-rc5
> * https://github.com/apache/iceberg/commits/apache-iceberg-0.9.0-rc5
> * https://github.com/apache/iceberg/tree/4e66b4c1
>
> The release tarball, signature, and checksums are here:
> * https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
>
> You can find the KEYS file here:
> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>
> Convenience binary artifacts are staged in Nexus. The Maven repository URL
> is:
> *
> https://repository.apache.org/content/repositories/orgapacheiceberg-1008/
>
> This release includes support for Spark 3 and vectorized reads for flat
> schemas in Spark.
>
> Please download, verify, and test.
>
> Please vote in the next 72 hours.
>
> [ ] +1 Release this as Apache Iceberg 0.9.0
> [ ] +0
> [ ] -1 Do not release this because...
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg V2 Spec

2020-07-02 Thread OpenInx
Sounds good to me.

Thanks.

On Fri, Jul 3, 2020 at 12:58 AM Ryan Blue  wrote:

> I'd like to get 0.9.0 out as soon as possible. I expect to get an early RC
> out next week, once we have more tests committed. That way, people can
> start trying it out and reporting back where it doesn't work.
>
> I'd rather not block 0.9.0 to wait on Flink connector components. There's
> still a lot of work to get in, so I think it would be good to keep these
> decoupled. That said, I think it would make sense to have a release once
> the Flink connector is ready, just like we would do for Spark 3 support.
>
> Does that sound reasonable?
>
> On Wed, Jul 1, 2020 at 7:39 PM OpenInx  wrote:
>
>> Hi Ryan:
>>
>> Just curious when do we plan to release 0.9.0 ?  I expect that the flink
>> connector could be included in release 0.9.0.
>>
>> Thanks.
>>
>> On Thu, Jul 2, 2020 at 12:14 AM Ryan Blue 
>> wrote:
>>
>>> Hi Chen,
>>>
>>> Right now, the main parts of the v2 spec are the addition of sequence
>>> numbers and delete files. We're also making some other requirements more
>>> strict, but those are mainly cleaning up problems and not related to
>>> row-level deletes.
>>>
>>> Upserts would be encoded as a delete and an insert. Deletes are stored
>>> in delete files, and inserts are normal data files. Delete files are valid
>>> within a partition, and apply to all data files with the same or lower
>>> sequence number.
>>>
>>> I'm planning on updating what's currently in the spec now that we have
>>> sequence numbers and delete file metadata committed in master, but right
>>> now I'm working on getting the 0.9.0 release out with support for Spark 3.
>>> The documentation should be coming in the next couple of weeks.
>>>
>>> rb
>>>
>>> On Wed, Jul 1, 2020 at 6:28 AM Chen Song  wrote:
>>>
>>>> I saw Table Spec V2
>>>> <https://iceberg.apache.org/spec/#version-2-row-level-deletes> was
>>>> mentioned in the official iceberg doc. I know it is incomplete and wip. Is
>>>> there any to-be-reviewed or proposed version for public view? I am
>>>> interested to understand how row level upserts are supported?
>>>>
>>>> Thanks
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Iceberg V2 Spec

2020-07-01 Thread OpenInx
Hi Ryan:

Just curious when do we plan to release 0.9.0 ?  I expect that the flink
connector could be included in release 0.9.0.

Thanks.

On Thu, Jul 2, 2020 at 12:14 AM Ryan Blue  wrote:

> Hi Chen,
>
> Right now, the main parts of the v2 spec are the addition of sequence
> numbers and delete files. We're also making some other requirements more
> strict, but those are mainly cleaning up problems and not related to
> row-level deletes.
>
> Upserts would be encoded as a delete and an insert. Deletes are stored in
> delete files, and inserts are normal data files. Delete files are valid
> within a partition, and apply to all data files with the same or lower
> sequence number.
>
> I'm planning on updating what's currently in the spec now that we have
> sequence numbers and delete file metadata committed in master, but right
> now I'm working on getting the 0.9.0 release out with support for Spark 3.
> The documentation should be coming in the next couple of weeks.
>
> rb
>
> On Wed, Jul 1, 2020 at 6:28 AM Chen Song  wrote:
>
>> I saw Table Spec V2
>>  was
>> mentioned in the official iceberg doc. I know it is incomplete and wip. Is
>> there any to-be-reviewed or proposed version for public view? I am
>> interested to understand how row level upserts are supported?
>>
>> Thanks
>> --
>> Chen Song
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


[Doc] Streaming CDC in Iceberg

2020-06-28 Thread OpenInx
Hi dev:

We have a discussion about the equality-deletes here [1].  It seems more
complex when considering the
CDC events streaming to the iceberg table, so I prepared a document for
further discussion here [2].

Any suggestions and feedback are welcome, thanks.

[1]. https://github.com/apache/iceberg/issues/360
[2].
https://docs.google.com/document/d/1bBKDD4l-pQFXaMb4nOyVK-Sl3N2NTTG37uOCQx8rKVc/edit#


Re: [DISCUSS] Changes for row-level deletes

2020-05-05 Thread OpenInx
Besides I'd like to share some work in my flink team, hope it will be
helpful for you.

We have customers who want to try the flink+iceberg to build their business
data lake, the classic scenarios are: 1. streaming click events into
iceberg and analyze by other olap engines ; 2.  streaming CDC events to
iceberg to reflect the freshness data.
Now we've built a public repo[1] to start our development and PoC, we are
developing in a separate repo because we want to do some technical
verification as soon as possible, not trying to split the apache iceberg
community. On the contrary,
we hope that some of our exploration work will have the opportunity to be
merged into the Apache Iceberg repository :-)

We've finished a fully functional flink sink connector[2] so far:
1.  Support exactly-once streaming semantic.  we've designed the state of
sink connector carefully so that it could failover correctly. Compared to
the Netflix FLINK connector, we're allowed to run multiple sink operator
instead of one operator which could improve the throughput a lot.
2.  Support almost all data types in iceberg. we provided the conversion
between iceberg and flink table so that different
engine(flink/spark/hive/presto) could share the the same table (the netflix
version is binded to a AVRO format, seems not a good choice.)
3.  Support both partition table and unpartitioned table now, similar to
the iceberg spark writer.
4.  Provided complete unit test and end-to-end test to verify the feature.
5.  Support  FLINK table API so that we could write the iceberg table by
FLINK SQL, such as INSERT INTO test SELECT .

The next step we will provide the following features:
1.  Provide a FLINK streaming reader to consume the incremental events from
upstream, so that we would accomplish the data pipeline:
(flink)->(iceberg)->(flink)->(iceberg) ...
2.  Implement the upsert feature in primary key cases,  say each row in
iceberg table will have a pk and the CDC will upsert the table with pk. The
current design seems could meet the customer's requirement so we plan to do
the PoC.

[1]. https://github.com/generic-datalake/iceberg
[2]. https://github.com/generic-datalake/iceberg/tree/master/flink/src

On Wed, May 6, 2020 at 11:44 AM OpenInx  wrote:

> The two-phrase approach  sounds good to me. the precondition is we have
> limited number of delete files so that memory can hold all of them, we will
> have the compaction service to reduce the delete files so it seems not a
> problem.
>


Re: [DISCUSS] Changes for row-level deletes

2020-05-05 Thread OpenInx
The two-phrase approach  sounds good to me. the precondition is we have
limited number of delete files so that memory can hold all of them, we will
have the compaction service to reduce the delete files so it seems not a
problem.


Re: Iceberg community sync notes - 15 April 2020

2020-04-16 Thread OpenInx
Thanks for the writing.
The views from Netflix branch is a great feature, would have any plan to
port to Apache Iceberg ?

On Fri, Apr 17, 2020 at 5:31 AM Ryan Blue  wrote:

> Here are my notes from yesterday’s sync. As usual, feel free to add to
> this if I missed something.
>
> There were a couple of questions raised during the sync that we’d like to
> open up to anyone who wasn’t able to attend:
>
>- Should we wait for the parallel metadata rewrite action before
>cutting 0.8.0 candidates?
>- Should we wait for ORC metrics before cutting 0.8.0 candidates?
>
> In the sync, we thought that it would be good to wait and get these in.
> Please reply to this if you agree or disagree.
>
> Thanks!
>
> *Attendees*:
>
>- Ryan Blue
>- Dan Weeks
>- Anjali Norwood
>- Jun Ma
>- Ratandeep Ratti
>- Pavan
>- Christine Mathiesen
>- Gautam Kowshik
>- Mass Dosage
>- Filip
>- Ryan Murray
>
> *Topics*:
>
>- 0.8.0 release blockers: actions, ORC metrics
>- Row-level delete update
>- Parquet vectorized read update
>- InputFormats and Hive support
>- Netflix branch
>
> *Discussion*:
>
>- 0.8.0 release
>   - Ryan: we planned to get a candidate out this week, but I think we
>   may want to wait on 2 things that are about ready
>   - First: Anton is contributing an action to rewrite manifests in
>   parallel that is close. Anyone interested? (Gautam is interested)
>   - Second: ORC is passing correctness tests, but doesn’t have
>   column-level metrics. Should we wait for this?
>   - Ratandeep: ORC also lacks predicate push-down support
>   - Ryan: I think metrics are more important than PPD because PPD is
>   task side and metrics help reduce the number of tasks. If we wait on 
> one,
>   I’d prefer to wait on metrics
>   - Ratandeep will look into whether he or Shardul can work on this
>   - General consensus was to wait for these features before getting a
>   candidate out
>- Row-level deletes
>   - Good progress in several PRs on adding the parallel v2 write
>   path, as Owen suggested last sync
>   - Junjie contributed an update to the spec for file/position delete
>   files
>- Parquet vectorized read
>   - Dan: flat schema reads are primarily waiting on reviews
>   - Dan: is anyone interested in complex type support?
>   - Gautam needs struct and map support. 0.14.0 doesn’t support maps
>   - Ryan (Murray): 0.17.0 will have lists, structs, and maps, but not
>   maps of structs
>   - Ryan (Blue): Because we have a translation layer in Iceberg to
>   pass off to Spark, we don’t actually need support in Arrow. We are
>   currently stuck on 0.14.0 because of changes that prevent us from 
> avoiding
>   a null check (see this comment
>   
>   )
>-
>
>InputFormat and Hive support
>- Ratandeep: Generic (mapreduce) InputFormat is in with hooks for Pig
>   and Hive; need to start working on the serde side and building a Hive
>   StorageHandler, missing DDL support
>   - Ryan: What DDL support?
>   -
>
>   Ratandeep: Statements like ADD PARTITION
>   -
>
>   Ryan: How would all of this work in Hive? It isn’t clear what
>   components are needed right now: StorageHandler? RawStore? HiveMetaHook?
>   - Ratandeep: Currently working on only the read path, which
>   requires a StorageHandler. The write path would be more difficult.
>   - Mass Dosage: Working on a (mapred) InputFormat for Hive in
>   iceberg-mr, started working on a serde in iceberg-hive. Interested in
>   writes, but not in the short or medium term
>   - Mass Dosage: The main problem is dependency conflicts between
>   Hive and Iceberg, mainly Guava
>   - Ryan: Anyone know a good replacement for Guava collections?
>   - Ryan: In Avro, we have a module that shades Guava
>   
> 
>   and has a class with references
>   
> .
>   Then shading can minimize the shaded classes. We could do that here.
>   - Ryan: Is Jackson also a problem?
>   - Mass Dosage: Yes, and calcite
>   - Ryan: Calcite probably isn’t referenced directly so we can
>   hopefully avoid the consistent versions problem by excluding it
>- Netflix branch of Iceberg (with non-Iceberg additions)
>   - Ryan: We’ve published a copy of our current Iceberg 0.7.0-based
>   branch 
>   for Spark 2.4 with DSv2 backported
>   
>   - The purpose of this is to share non-Iceberg work that we use to
>   compliment Iceberg, like views, catalogs, 

Re: Open a new branch for row-delete feature ?

2020-04-02 Thread OpenInx
OK.  we're going to push this in two ways. the first thing is providing the
PR which don't depend on the table specification and the next thing is
providing a proposal to address the table specification things.

Thanks for the feedback.

On Thu, Apr 2, 2020 at 3:22 AM Ryan Blue  wrote:

> I agree with Miao’s points. I think what I missed in an email above was
> the assertion that we can move faster in a branch than merging PRs into
> master. We should be breaking the work down into small reviewable pieces,
> as you suggested, but merge them to master like Miao pointed out. That has
> the benefits of more iterative development and doesn’t have the costs of a
> branch.
>
> Iterative development is what we’ve always tried to encourage for this
> feature, which is why we broke the design down into next steps in the
> milestone. It’s good to see progress on those, like Junjie’s PR to add
> file/position delete files to the spec. There is still quite a bit to do
> that is unblocked:
>
>- Write up a spec for equality delete files
>- Implement a reader and writer for file/position deletes
>- Implement a reader and writer for equality deletes
>- Implement a row filter that applies a file/position delete file
>(using a merge strategy)
>- Implement a row filter that applies equality deletes using a hash set
>- Implement a row filter that applies equality deletes using a merge
>strategy
>- Make progress on adding sort orders to the spec (know when to apply
>the equality deletes with a merge)
>
> For the two PRs you referenced:
>
>1. Adding sequence numbers is difficult because it requires an option
>to move to the v2 format. This is going to take some careful planning that
>I haven’t seen a proposal for yet, and haven’t had a chance to do myself
>yet. I suggest focusing on other areas that are not blocked at the moment.
>2. As I mentioned in the sync, how we will track delete files in
>metadata is still outstanding. Maybe we will choose to add a type column
>like this, but I’d like to have a design in mind before we merge these PRs.
>Thinking through this and coming up with a proposal here is the next
>priority for this work, because it will unlock more tasks we can do in
>parallel.
>
>
> On Tue, Mar 31, 2020 at 7:51 PM OpenInx  wrote:
>
>> Hi Ryan & Miao
>>
>> I'm fine if you think it's not the appropriate time to open a new feature
>> branch for row-delete feature. The point for us is to accomplish the
>> row-delete feature as soon as possible in apache iceberg, so that our
>> community users could benefit from it and adopt the iceberg to more
>> scenarios.
>> As we know the table specification changes is the foundation, so what do
>> you think about the newly introduced table specification in PR ?  If there
>> are some concerns in your side, please let us know, so that we could
>> iterate the patches and push the work forward.
>> 1.   sequence_number.
>> https://github.com/apache/incubator-iceberg/pull/588. the delete
>> differential file is also considered as a data file, and the
>> 2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.
>>  the delete differential file is also considered as a data file, the field
>> is used for distinguishing wether is a normal data file,  a file/pos delete
>> file or equality-delete file.
>>
>> Thanks.
>>
>>
>> On Wed, Apr 1, 2020 at 2:44 AM Miao Wang  wrote:
>>
>>> +1 on not creating a branch now. Rebasing and maintenance are too
>>> expensive, comparing of fast development. Some additional thoughts below.
>>>
>>>
>>>
>>> Row-delete feature should be behind a feature flag, which implies that
>>> it should have minimum impact on Master branch if it is turned off. Working
>>> on Master avoids the pain of breaking Master at branch merge, which
>>> actually works at a fail-fast and fail-early mode.
>>>
>>>
>>>
>>> Working on Master Branch will not prevent splitting the feature into
>>> small items. Instead, it will encourage more people to work it and help the
>>> community stay focus on Master roadmap.
>>>
>>>
>>>
>>> Finally, if we think about rebasing, it either ends too expensive to
>>> rebase or easy to rebase. If it is the former case, we should not create a
>>> branch because it is hard to keep sync with Master. If it is the latter
>>> case, it has little impact on Master and there is no need to have a branch.
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>&g

Re: Open a new branch for row-delete feature ?

2020-03-31 Thread OpenInx
Hi Ryan & Miao

I'm fine if you think it's not the appropriate time to open a new feature
branch for row-delete feature. The point for us is to accomplish the
row-delete feature as soon as possible in apache iceberg, so that our
community users could benefit from it and adopt the iceberg to more
scenarios.
As we know the table specification changes is the foundation, so what do
you think about the newly introduced table specification in PR ?  If there
are some concerns in your side, please let us know, so that we could
iterate the patches and push the work forward.
1.   sequence_number.  https://github.com/apache/incubator-iceberg/pull/588.
the delete differential file is also considered as a data file, and the
2.   file_type. https://github.com/apache/incubator-iceberg/pull/885.   the
delete differential file is also considered as a data file, the field is
used for distinguishing wether is a normal data file,  a file/pos delete
file or equality-delete file.

Thanks.


On Wed, Apr 1, 2020 at 2:44 AM Miao Wang  wrote:

> +1 on not creating a branch now. Rebasing and maintenance are too
> expensive, comparing of fast development. Some additional thoughts below.
>
>
>
> Row-delete feature should be behind a feature flag, which implies that it
> should have minimum impact on Master branch if it is turned off. Working on
> Master avoids the pain of breaking Master at branch merge, which actually
> works at a fail-fast and fail-early mode.
>
>
>
> Working on Master Branch will not prevent splitting the feature into small
> items. Instead, it will encourage more people to work it and help the
> community stay focus on Master roadmap.
>
>
>
> Finally, if we think about rebasing, it either ends too expensive to
> rebase or easy to rebase. If it is the former case, we should not create a
> branch because it is hard to keep sync with Master. If it is the latter
> case, it has little impact on Master and there is no need to have a branch.
>
>
>
> Thanks!
>
>
>
> Miao
>
>
>
> *From: *Ryan Blue 
> *Reply-To: *"dev@iceberg.apache.org" , "
> rb...@netflix.com" 
> *Date: *Tuesday, March 31, 2020 at 10:08 AM
> *To: *OpenInx 
> *Cc: *Iceberg Dev List 
> *Subject: *Re: Open a new branch for row-delete feature ?
>
>
>
> I'm fine starting a branch later if we do run into those issues, but I
> don't think it is a good idea to do it now in anticipation. All of the work
> that we can do on master we should try to do on master. We can start a
> branch when we need one.
>
>
>
> On Mon, Mar 30, 2020 at 7:44 PM OpenInx  wrote:
>
> Hi Ryan
>
>
>
> The reason I suggest to open a new dev branch for row-delete development
> is:  we will split the whole feature into
>
> many small issues and each issue will have a pull request with appropriate
> length of code so the contributors/reviewers
>
> can discuss one point each time and make this feature a faster iteration.
> In the process of implementation, we will ensure
>
> that the v1 works for every separate PR but it may not ready for cutting
> release, for example, when release the 0.8.0 I'm
>
> sure we won't like the release version contains part of the v2 spec(such
> as provide the sequence_number, but no file_type).
>
> The spark reader/writer and data/delete manifest may also need some code
> refactor, it's possible to put them into several PR.
>
> Splitting into multiple Pull Requests may block the release of the new
> version for a certain period of time, that's not we want
>
> to see.
>
>
> About the new branch maintenance, in my experience we could rebase the new
> branch with master periodly(such as rebase
>
> for every three days), so that the new pull request for row-delete will be
> designed based on the newest changes. It should work
>
> for the master which would not have too many new change. This is in line
> with our current situation.
>
>
>
> In this case, I weighed the maintenance costs of the new branch against
> the delay of the row-delete. I think we should let the
>
> row-delete go a little faster (almost all community users are looking
> forward to this feature), and I think the current maintenance
>
> cost is acceptable.
>
>
>
> Thanks
>
>
>
> On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue 
> wrote:
>
> Sorry, I didn't address the suggestion to add a Flink branch as well. The
> work needed for the Flink sink is to remove parts that are specific to
> Netflix, so I'm not sure what the rationale for a branch would be. Is there
> a reason why this can't be done in master, but requires a shared branch? If
> multiple people want to contribute, why not contribute to the same PR?
>
>
>
> A shared 

Re: Open a new branch for row-delete feature ?

2020-03-30 Thread OpenInx
Hi Ryan

The reason I suggest to open a new dev branch for row-delete development
is:  we will split the whole feature into
many small issues and each issue will have a pull request with appropriate
length of code so the contributors/reviewers
can discuss one point each time and make this feature a faster iteration.
In the process of implementation, we will ensure
that the v1 works for every separate PR but it may not ready for cutting
release, for example, when release the 0.8.0 I'm
sure we won't like the release version contains part of the v2 spec(such as
provide the sequence_number, but no file_type).
The spark reader/writer and data/delete manifest may also need some code
refactor, it's possible to put them into several PR.
Splitting into multiple Pull Requests may block the release of the new
version for a certain period of time, that's not we want
to see.

About the new branch maintenance, in my experience we could rebase the new
branch with master periodly(such as rebase
for every three days), so that the new pull request for row-delete will be
designed based on the newest changes. It should work
for the master which would not have too many new change. This is in line
with our current situation.

In this case, I weighed the maintenance costs of the new branch against the
delay of the row-delete. I think we should let the
row-delete go a little faster (almost all community users are looking
forward to this feature), and I think the current maintenance
cost is acceptable.

Thanks

On Tue, Mar 31, 2020 at 5:52 AM Ryan Blue  wrote:

> Sorry, I didn't address the suggestion to add a Flink branch as well. The
> work needed for the Flink sink is to remove parts that are specific to
> Netflix, so I'm not sure what the rationale for a branch would be. Is there
> a reason why this can't be done in master, but requires a shared branch? If
> multiple people want to contribute, why not contribute to the same PR?
>
> A shared PR branch makes the most sense to me for this because it is
> regularly tested against master.
>
> On Mon, Mar 30, 2020 at 2:48 PM Ryan Blue  wrote:
>
>> I think we will eventually may want a branch, but I think it is too early
>> to create one now.
>>
>> Branches are expensive. They require maintenance to stay in sync with
>> master, usually copying changes from master into the branch with updates.
>> Updating the changes to master for the branch is more difficult because it
>> is usually not the original contributor or reviewer porting them. And it is
>> better to catch problems between changes in master and the branch early.
>>
>> I'm not against branches, but I don't want to create them unless they are
>> valuable. In this case, I don't see the value. We plan to add v2 in
>> parallel so you can still write v1 tables for compatibility, and most of
>> the work that needs to be done -- like creating readers and writers for
>> diff formats -- can be done in master.
>>
>> rb
>>
>> On Mon, Mar 30, 2020 at 9:00 AM Gautam  wrote:
>>
>>> Thanks for bringing this up OpenInx.  That's a great idea: to open a
>>> separate branch for row-level deletes.
>>>
>>> I would like to help support/contribute/review this as well. If there
>>> are sub-tasks you guys have identified that can be added to
>>> https://github.com/apache/incubator-iceberg/milestone/4 we can start
>>> taking those up too.
>>>
>>> thanks for the good work,
>>> - Gautam.
>>>
>>>
>>>
>>> On Mon, Mar 30, 2020 at 8:39 AM Junjie Chen 
>>> wrote:
>>>
>>>> +1 to create the branch. Some row-level delete subtasks must be based
>>>> on the sequence number as well as end to end tests.
>>>>
>>>> On Fri, Mar 27, 2020 at 4:42 PM OpenInx  wrote:
>>>>
>>>>> Dear Dev:
>>>>>
>>>>>  Tuesday, we had a sync meeting. and discussed about the things:
>>>>>  1.  cut the 0.8.0 release;
>>>>>  2.  flink connector ;
>>>>>  3.  iceberg row-level delete;
>>>>>  4. Map-Reduce Formats and Hive support.
>>>>>
>>>>>   We'll release version 0.8.0 around April 15, the following 0.9.0
>>>>> will be
>>>>>  released in the next few month. On the other hand, Ryan, Junjie
>>>>> Chen
>>>>>  and I have done three PoC versions for the row-level deletes. We
>>>>> had
>>>>>  a full discussion[4] and started to do the relevant code design.
>>>>> we're sure that
>>>>>  the feature will introduce some incompatible specification,  suc

Re: Iceberg community sync - 2020-03-25

2020-03-28 Thread OpenInx
> Ryan has concerns about blogs in docs - why not link to blogs on other
platforms? We don’t want content to get stale or have the community
“reviewing” content.
I mean we could create a page to collect all the design doc links first.
The stale content is indeed a problem unless we update the doc for each
relative change. I don't have the strong opinion about the reviewing
comments :-)

> Ryan: we’ll need reviewers because I’m not qualified. Will reach out to
Steven Wu (Netflix sink author) and other people interested in Flink.
Steven did a great job, he's the perfect reviewer if he has the bandwidth.
There're some flink committers and PMC in our flink team, we could also
ping them.

> Openinx brought up concerns about minimizing end-to-end latency
Agreed that we could implement the file/pos deletes and equality-deletes
firstly. The off-line optimization seems reasonable, we also have an
internal discussion about the e2e latency and have some ideas to minimize
it, maybe I could provide a simple doc to describe the idea. Anyway we
could push the file/pos and equality deletes forward first.

On Sat, Mar 28, 2020 at 8:54 AM Ryan Blue  wrote:

> Hi everyone,
>
> Here are my notes from the discussion. These are based mainly on my
> memory, so feel free to correct or expand if you think it can be improved.
> Thanks!
>
> *Agenda*
>
>- Cadence for syncs - every 2-4 weeks?
>- 0.8.0 Java release
>- Community building
>- Flink source and sink status
>- MR formats and Hive support status
>- Security (authorization, data values in metadata)
>- Row-level deletes (main discussion)
>
> *Discussion*:
>
>- Sync cadence
>   - Ryan: with syncs alternating time zones, 4 weeks is too long, but
>   2 weeks is a lot for those of us attending all of them. How about 3 
> weeks?
>   - Consensus was every 3 weeks
>- 0.8.0 Java release
>   - When should we target for the release? Consensus was for
>   Mid-April (3 weeks)
>   - What do we want in the release? Main outstanding features are ORC
>   support, Parquet vectorized reads, Spark/Hive changes
>   - Ideally will include ORC support, since it is close
>   - Hive version is 2.3 and should not block Hive work
>   - Vectorized reads are nice-to-have but should not block a release
>   - Can we disable consistent versions for Spark 2.4 and Spark 3.0
>   support in the same repo? Ryan will dig up build script with baseline
>   applied to only some modules, maybe we can disable it
>- Community building
>   - Saisai suggested a Powered By page where we can post who is using
>   Iceberg in production. Great idea!
>   - Openinx suggested a blog section of the docs site
>   - Ryan has concerns about blogs in docs - why not link to blogs on
>   other platforms? We don’t want content to get stale or have the 
> community
>   “reviewing” content.
>   - Owen: some blogs break links
>- Flink source and sinks status
>   - Tencent data lake team posted a sink based on Netflix skunkworks,
>   but needs to remove Netflix-specific features/dependencies
>   - Issues opened for work to get sink in
>   - Ryan: we’ll need reviewers because I’m not qualified. Will reach
>   out to Steven Wu (Netflix sink author) and other people interested in 
> Flink.
>   - Ryan: the Spark source is coming along, but the hardest part is
>   getting a stream of files to process from table state. Is that 
> something we
>   want to share between Spark and Flink implementations?
>   - Probably want to share, if possible
>- Skipped MR/Hive status and security (will start dev list thread) to
>get to row-level deletes
>- Row-level deletes roadmap:
>   - Ryan will be working on this more, with a doc for Spark MERGE
>   INTO interfaces coming soon
>   - This has been moving slowly because some parts, like sequence
>   numbers, require forward-breaking/v2 changes
>   - Owen suggested building two parallel write paths to be able to
>   write v1. Everyone agreed with this
>   - There are several projects that can be done by anyone and do not
>   require forward-breaking/v2 changes: delete file format readers, 
> writers,
>   record iterator implementations to merge deletes (set-based, 
> merge-based),
>   and specs for these once they are built
>   - Junjie offered to work on file/position delete files
>   - Equality delete merges are blocked on sort order addition to the
>   format
>   - Main blocking decision point is how to track delete files in
>   manifests, Ryan will start a dev list thread
>   - Openinx brought up concerns about minimizing end-to-end latency
&

Open a new branch for row-delete feature ?

2020-03-27 Thread OpenInx
Dear Dev:

 Tuesday, we had a sync meeting. and discussed about the things:
 1.  cut the 0.8.0 release;
 2.  flink connector ;
 3.  iceberg row-level delete;
 4. Map-Reduce Formats and Hive support.

  We'll release version 0.8.0 around April 15, the following 0.9.0 will
be
 released in the next few month. On the other hand, Ryan, Junjie Chen
 and I have done three PoC versions for the row-level deletes. We had
 a full discussion[4] and started to do the relevant code design. we're
sure that
 the feature will introduce some incompatible specification,  such as
the
 sequence_number spec[1], file_type spec[2], the sortedOrder feature
seems
 also to be a breaking change [3].

 To avoid affecting the release of version 0.8.0 and push the
row-delete feature
 early. I suggest to open a new branch for the row-delete feature, name
it branch-1.
 Once the row-delete feature is stable, we could release the 1.0.0. Or
we can just
 open a row-delete feature branch and once the work is done we will
merge
 the row-delete feature branch back to master branch, and continue to
release the 0.9.0
 version.

 I guess the flink connector dev are facing the same problem ?

 What do you think about this ?

 Thank you.


  [1]. https://github.com/apache/incubator-iceberg/pull/588
  [2]. https://github.com/apache/incubator-iceberg/issues/824
  [3]. https://github.com/apache/incubator-iceberg/issues/317
  [4].
https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing


Re: What have I learned from doing Merge-On-Read PoC

2020-03-23 Thread OpenInx
Thanks for sharing the PoC work from you team, Junjie.

I read your PoC PRs and issues. you considered the whole path included
spark write behaviors (while I only considered the iceberg write), it
helps us understand all the update/delete work.

There're some points we might need to discuss :-)

1.  the spark would delete the rows by the following API:

IcebergSource icebergTable = new IcebergSource();
icebergTable.deleteWhere(dbTable, new Filter[]{new EqualTo("data", "a1")});

Would the filter condition be limited to only support several simple
build-in filters, such as =, !=, >, <, <=, >=, IN, NOT etc ?  I saw delta
also defined
the same behavior [1][2]. In my mind, our users would like to run a
unlimited update/delete, which means can do the following SQL:

update employee set work_year = work_year + 1 where company_id in  (select
id from company) and birthday >='1990-01-01';

I've considered the implementation, we may need to translate the UPDATE
plan to a SELECT plan so that we can read all file_id & offset , and finally
dump them into delete differential files. we've discussed this problem
before and said the spark don't provide the update physical update plan so
may
have some obstacles. we flink may try to accomplish the full UPDATE WHERE,
I don't know how do  people from the community  think about the
diff design. so I rise the issue here.

2.  the spark writer would write the InternalRowWithMeta into data files,
the row is a tuple [3] .  we might don't
need to write
the file_path because all the rows in a data file share the same file_path,
it always a string with long length and would cost lost of resources to
compare
and sort (I chose the 1-1 solution) . row_offset may also could be designed
implicitly as I said in the document, may need a PoC demo to proof this ( I
also
defined an explicit row_offset in the table in PoC, :-) ).
btw, we might could also sort the delete differential files by  so that we can do the merge sort for faster JOIN [4].

3. Yeah, the sequence number should be added to datafile, manifest,
snapshot. as the PR discussed, compatibility is an issue to consider.

[1].
https://github.com/delta-io/delta/blob/master/src/main/scala/io/delta/tables/DeltaTable.scala#L265
[2]. https://docs.databricks.com/delta/delta-update.html
[3].
https://github.com/apache/incubator-iceberg/compare/master...chenjunjiedada:row-level-delete#diff-c168df8c9739650eab655b22b0b549acR407
[4].
https://github.com/apache/incubator-iceberg/compare/master...chenjunjiedada:row-level-delete#diff-fffa37e29d3736de086cbd23094865b7R63



On Sun, Mar 22, 2020 at 8:49 PM Junjie Chen 
wrote:

> Great job and nice document @OpenInx! Thanks for sharing the progress!
>
> I also did the PoC a couple of weeks ago, you can take a look the code
> here
> <https://github.com/chenjunjiedada/incubator-iceberg/tree/row-level-delete>.
> My approach is to use the additional meta columns (SRI)  and it is based on
> the sequence number pull request #588
> <https://github.com/apache/incubator-iceberg/pull/588>.  The main
> differences from yours include:
>
>- base file write path: It hooks the internal row to add metadata for
>file name and row id.
>- delete file write path: It uses the spark to generate the deletion
>files via a staging table, and also sort the deletion file with file name.
>- read path: Beside the sequence number, it uses the low bound and
>upper bound to narrow down the deletion files.
>- base file + deletion file merge:  It uses filter API and also need
>merge sort optimization.
>
> FYI, there is also an issue
> <https://github.com/apache/incubator-iceberg/issues/825> about the
> addtional meta column, it seems like spark will handle the additional
> columns for iceberg so I didn't go further about that.
>
> Besides the design doc, we still need to finalize more detail for merge on
> read and I think that would be a good topic for next sync-up meeting.
>
>
>
>
>
> On Sat, Mar 21, 2020 at 9:01 PM OpenInx  wrote:
>
>> Dear Iceberg Dev:
>>
>> As I said in the document[1] before,  we think the iceberg update/delete
>> features (mainly merge-on-read) is the high
>> priority feature (we've also discussed some flink+iceberg scenarios and
>> anybody who interest that part can read
>> the document).
>>
>> Recently, I write some demo to implement the merge-on-read thing( PoC).
>> The pull request is here [2], I also provided
>> a document to show the work [3].
>>
>> Any suggestion or feedback would be appreciated, Thanks.
>>
>> [1].
>> https://docs.google.com/document/d/1I7FUPHyyvtZZ7zaTT1Lq14rNIEZFhzD41-fazVHEoIA/edit?usp=sharing
>> [2]. https://github.com/openinx/incubator-iceberg/pull/5/files
>> [3].
>> https://docs.google.com/document/d/1CPFun2uG-eXdJggqKcPsTdNa2wPMpAdw8loeP-0fm_M/edit?usp=sharing
>>
>>
>
> --
> Best Regards
>


  1   2   >