Re: [DISCUSS] Spark 3.1 support?

2023-04-24 Thread Edgar Rodriguez
Hi all,

Thanks for the discussion. Similarly to Manu, we're in Spark 3.1.1 and
Iceberg 1.1.0 - we backport Spark 3.1.1 fixes internally as well. It's a
bit more complicated to move fast on Spark versions internally, mainly due
to the number of scala customers that we have.

I understand maintaining yet another Spark version is burdensome so I'm +1
on marking 3.1 deprecated, and I'd be happy to contribute on backports if
needed on a community maintained branch, we'd just need to tag changes that
may need a backport.

Cheers,

On Sun, Apr 23, 2023 at 4:40 PM Ryan Blue  wrote:

> Thank you for stepping up and offering to help, Manu. I'm glad that you're
> willing to help with backports.
>
> On Sun, Apr 23, 2023 at 2:05 AM Manu Zhang 
> wrote:
>
>>   You would just end up backporting twice.
>>
>>
>> That's why I said a community maintained branch benefits us, saving one
>> backport. Note the first backport is more difficult, sometimes requiring
>> rewriting the PR since there would be API differences between Spark
>> versions.
>> The second backport will be much easier if we focus on bug fixes.
>> Meanwhile, it's also easier for us to upgrade to Iceberg 1.2+ if 3.1
>> support is still available although deprecated.
>>
>>
>
> --
> Ryan Blue
> Tabular
>


-- 
Edgar R


Re: [DISCUSS] Spark 3.1 support?

2023-04-21 Thread Edgar Rodriguez
Airbnb is also still on Spark 3.1 and I echo some of Walaa's comments.

Cheers,

On Thu, Apr 20, 2023 at 8:14 PM Walaa Eldin Moustafa 
wrote:

> LinkedIn is still on Spark 3.1. I am guessing a number of other companies
> could be in the same boat. I feel the argument for Spark 2.4 is different
> from that of Spark 3.1 and it would be great if we can continue to support
> 3.1 for some time.
>
> On Wed, Apr 19, 2023 at 11:06 AM Ryan Blue  wrote:
>
>> +1
>>
>> As we said in the 2.4 discussion, the format itself should provide
>> forward compatibility with tables and it is more clear that we aren't
>> adding new features if you have to use older versions for Spark 3.1.
>>
>> On Wed, Apr 19, 2023 at 10:08 AM Anton Okolnychyi
>>  wrote:
>>
>>> Hey folks,
>>>
>>> What does everybody think about Spark 3.1 support after we add Spark 3.4
>>> support? Our initial plan was to release jars for the last 3 versions. Are
>>> there any blockers for dropping 3.1?
>>>
>>> - Anton
>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

-- 
Edgar R


Re: [DISCUSS] Dropping Spark 2.4 support

2023-04-18 Thread Edgar Rodriguez
I'm generally +1 on dropping Spark 2.4 - mostly everyone is moving to Spark
3.x, if not already moved.

As for the Hadoop upgrade, I think that could be problematic for us if
there's any non-backwards compatible API change required at compile time
since we're still running a 2.8.x version.

Cheers,

On Mon, Apr 17, 2023 at 3:50 PM Steve Zhang 
wrote:

> +1 for dropping Spark 2.4 support and we can clean up doc as well such as
> https://iceberg.apache.org/docs/latest/spark-queries/#spark-24
>
> Thanks,
> Steve Zhang
>
>
>
> On Apr 13, 2023, at 12:53 PM, Jack Ye  wrote:
>
> +1 for dropping 2.4 support
>
>
>

-- 
Edgar R
Data Warehouse Infrastructure


Re: [VOTE] Release Apache Iceberg 0.11.1 RC0

2021-03-30 Thread Edgar Rodriguez
+1 (non-binding)

- Verified build, signature and checkum.
- Ran internal integration tests.

Cheers,

On Tue, Mar 30, 2021 at 7:50 AM Ryan Murray  wrote:

> +1 (non-binding)
>
> verified build, tests, signature, checksum.
>
> Best,
> Ryan
>
> On Tue, Mar 30, 2021 at 4:40 AM Jack Ye  wrote:
>
>> +1 (non-binding)
>>
>> Verified build, unit test, AWS integration test, signature, checksum.
>> Verified fix of #2146, #2267, #2333 in AWS EMR Spark3 environment.
>>
>> Best,
>> Jack Ye
>>
>> On Mon, Mar 29, 2021 at 5:58 PM Anton Okolnychyi
>>  wrote:
>>
>>> +1 (binding)
>>>
>>> Checked the signature and checksum, ran RAT checks and tests.
>>>
>>> Had to trigger tests twice due to a HMS related failure in DELETE tests
>>> in Spark extensions. We have noticed that problem while testing 0.11.0 RCs
>>> and it was fixed in a later commit, which is not part of 0.11.1.
>>>
>>>
>>> https://github.com/apache/iceberg/commit/19622dcfcb426485748fa017a6181e23df5732dc
>>>
>>> - Anton
>>>
>>> On 29 Mar 2021, at 17:42, Anton Okolnychyi <
>>> aokolnyc...@apple.com.INVALID> wrote:
>>>
>>> Here is the link to steps we normally use to validate a release
>>> candidate:
>>>
>>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E
>>> 
>>>
>>> - Anton
>>>
>>> On 29 Mar 2021, at 17:41, Anton Okolnychyi <
>>> aokolnyc...@apple.com.INVALID> wrote:
>>>
>>> Hi everyone,
>>>
>>> I propose the following RC to be released as official Apache Iceberg
>>> 0.11.1 release.
>>>
>>> The commit id is 29cf712a821aa937e176f2d79a5593c4a1429e7f
>>> * This corresponds to the tag: apache-iceberg-0.11.1-rc0
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.11.1-rc0
>>> *
>>> https://github.com/apache/iceberg/tree/29cf712a821aa937e176f2d79a5593c4a1429e7f
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.11.1-rc0/
>>>
>>> You can find the KEYS file here (make sure to import the new key that
>>> was used to sign the release):
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged in Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1016/
>>>
>>> This patch release includes these fixes:
>>> https://github.com/apache/iceberg/milestone/13?closed=1
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.11.1
>>> [ ] +0
>>> [ ] -1 Do not release this because…
>>>
>>> Thanks,
>>> Anton
>>>
>>>
>>>
>>>

-- 
Edgar R


Re: Welcoming Ryan Murray as a new committer!

2021-03-29 Thread Edgar Rodriguez
Congratulations, Ryan!

Best,

On Mon, Mar 29, 2021 at 10:39 PM Jack Ye  wrote:

> Congratulations Ryan!
>
> On Mon, Mar 29, 2021 at 7:25 PM OpenInx  wrote:
>
>> Congrats, Ryan !  Well-deserved !
>>
>> On Tue, Mar 30, 2021 at 9:32 AM Junjie Chen 
>> wrote:
>>
>>> Congratulations. Ryan!
>>>
>>> On Tue, Mar 30, 2021 at 5:02 AM Daniel Weeks 
>>> wrote:
>>>
 Congrats, Ryan and thanks for all the great work!

 On Mon, Mar 29, 2021 at 1:59 PM Ryan Blue 
 wrote:

> Congratulations, Ryan!
>
> On Mon, Mar 29, 2021 at 1:49 PM Thirumalesh Reddy <
> thirumal...@dremio.com> wrote:
>
>> Congratulations Ryan
>>
>> Thirumalesh Reddy
>> Dremio | VP of Engineering
>>
>>
>> On Mon, Mar 29, 2021 at 9:16 AM Xinli shang 
>> wrote:
>>
>>> Congratulations Ryan!
>>>
>>> On Mon, Mar 29, 2021 at 9:13 AM Russell Spitzer <
>>> russell.spit...@gmail.com> wrote:
>>>
 :D We will always be Iceberg-comitter twins now

 > On Mar 29, 2021, at 11:10 AM, Szehon Ho
  wrote:
 >
 > That’s awesome, great work Ryan.
 >
 > Szehon
 >
 >> On 29 Mar 2021, at 18:08, Anton Okolnychyi
  wrote:
 >>
 >> Hey folks,
 >>
 >> I’d like to welcome Ryan Murray as a new committer to the
 project!
 >>
 >> Thanks for all the hard work, Ryan!
 >>
 >> - Anton
 >


>>>
>>> --
>>> Xinli Shang
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

>>>
>>> --
>>> Best Regards
>>>
>>

-- 
Edgar R


Re: Welcoming Russell Spitzer as a new committer

2021-03-29 Thread Edgar Rodriguez
Congrats, Russell!

Cheers,

On Mon, Mar 29, 2021 at 11:01 PM Robin Stephen 
wrote:

> Congratulations, Russell!
>
> Jack Ye  于2021年3月30日周二 上午10:39写道:
>
>> Congratulations Russell!
>>
>> On Mon, Mar 29, 2021 at 7:25 PM OpenInx  wrote:
>>
>>> Congrats, Russell !  Well-deserved !
>>>
>>> On Tue, Mar 30, 2021 at 9:33 AM Junjie Chen 
>>> wrote:
>>>
   Congratulations, Russell! Nice work!

 On Tue, Mar 30, 2021 at 5:02 AM Daniel Weeks 
 wrote:

> Congrats, Russell!
>
> On Mon, Mar 29, 2021 at 1:59 PM Ryan Blue 
> wrote:
>
>> Congratulations, Russell!
>>
>> -- Forwarded message -
>> From: Gautam Kowshik 
>> Date: Mon, Mar 29, 2021 at 12:16 PM
>> Subject: Re: Welcoming Russell Spitzer as a new committer
>> To: 
>>
>>
>> Congrats Russell!
>>
>> Sent from my iPhone
>>
>> On Mar 29, 2021, at 9:41 AM, Dilip Biswal  wrote:
>>
>> 
>> Congratulations Russel !! Very well deserved, indeed !!
>>
>> On Mon, Mar 29, 2021 at 9:13 AM Miao Wang 
>> wrote:
>>
>>> Congratulations Russell!
>>>
>>>
>>>
>>> Miao
>>>
>>>
>>>
>>> *From: *Szehon Ho 
>>> *Reply-To: *"dev@iceberg.apache.org" 
>>> *Date: *Monday, March 29, 2021 at 9:12 AM
>>> *To: *"dev@iceberg.apache.org" 
>>> *Subject: *Re: Welcoming Russell Spitzer as a new committer
>>>
>>>
>>>
>>> Awesome, well-deserved, Russell!
>>>
>>>
>>>
>>> Szehon
>>>
>>>
>>>
>>> On 29 Mar 2021, at 18:10, Holden Karau  wrote:
>>>
>>>
>>>
>>> Congratulations Russel!
>>>
>>>
>>>
>>> On Mon, Mar 29, 2021 at 9:10 AM Anton Okolnychyi <
>>> aokolnyc...@apple.com.invalid> wrote:
>>>
>>> Hey folks,
>>>
>>> I’d like to welcome Russell Spitzer as a new committer to the
>>> project!
>>>
>>> Thanks for all your contributions, Russell!
>>>
>>> - Anton
>>>
>>> --
>>>
>>> Twitter: https://twitter.com/holdenkarau
>>> 
>>>
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> 
>>>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> 
>>>
>>>
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

 --
 Best Regards

>>>

-- 
Edgar R


Re: Nessie PRs

2021-03-15 Thread Edgar Rodriguez
Thanks for taking a look!

Best,

On Mon, Mar 15, 2021 at 7:36 PM Daniel Weeks 
wrote:

> Looks like things are healthy again after just restarting the CI run.
> Thanks for pointing that out Edgar!
>
> On Mon, Mar 15, 2021 at 1:39 PM Daniel Weeks  wrote:
>
>> I saw a number of artifacts that looked unrelated to the change that
>> couldn't be resolved and assumed it was a temporary issue.
>>
>> I'll see if I can kick it off again (or Ryan M. if you have a chance to
>> look at the error, that would help).
>>
>> On Mon, Mar 15, 2021 at 12:23 PM Edgar Rodriguez
>>  wrote:
>>
>>> FYI I think https://github.com/apache/iceberg/pull/2307 broke the CI,
>>> I'm seeing the following errors:
>>>
>>> FAILURE: Build failed with an exception.
>>> 29
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:29>
>>> 30
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:30>*
>>> What went wrong:
>>> 31
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:31>A
>>> problem occurred configuring root project 'iceberg'.
>>> 32
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:32>>
>>> Could not resolve all artifacts for configuration ':classpath'.
>>> 33
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:33>
>>> > Could not resolve
>>> org.projectnessie:org.projectnessie.gradle.plugin:0.4.0.
>>> 34
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:34>
>>> Required by:
>>> 35
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:35>
>>> project :
>>> 36
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:36>
>>> > Could not resolve
>>> org.projectnessie:org.projectnessie.gradle.plugin:0.4.0.
>>> 37
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:37>
>>> > Could not get resource '
>>> https://jcenter.bintray.com/org/projectnessie/org.projectnessie.gradle.plugin/0.4.0/org.projectnessie.gradle.plugin-0.4.0.pom'.
>>>
>>> 38
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:38>
>>> > Could not HEAD '
>>> https://jcenter.bintray.com/org/projectnessie/org.projectnessie.gradle.plugin/0.4.0/org.projectnessie.gradle.plugin-0.4.0.pom'.
>>>
>>> 39
>>> <https://github.com/edgarRd/iceberg/runs/2114545801?check_suite_focus=true#step:5:39>
>>> > Read timed out
>>>
>>> Cheers,
>>>
>>> On Mon, Mar 15, 2021 at 2:54 PM Ryan Murray  wrote:
>>>
>>>> Thanks a lot Dan!
>>>>
>>>> On Mon, Mar 15, 2021 at 4:58 PM Daniel Weeks 
>>>> wrote:
>>>>
>>>>> Hey Ryan,  I just took care of the first one and I might have some
>>>>> time to look over the others today or tomorrow (unless someone else gets 
>>>>> to
>>>>> them first).
>>>>>
>>>>> -Dan
>>>>>
>>>>> On Mon, Mar 15, 2021 at 7:03 AM Ryan Murray  wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> I have a few Nessie PRs that I am hoping to try and get merged. Is
>>>>>> there a committer who has a bit of free time? Happy to help w/ some of 
>>>>>> the
>>>>>> details if the PRs aren't clear. I am @rymurr on the apache slack if you
>>>>>> want a quick response.
>>>>>>
>>>>>> The nessie version bump is the most pressing and the easiest:
>>>>>> https://github.com/apache/iceberg/pull/2307
>>>>>> The PRs are under the nessie label:
>>>>>>
>>>>>> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3ANESSIE
>>>>>> and one doc PR:
>>>>>> https://github.com/apache/iceberg/pull/2292
>>>>>>
>>>>>> Best,
>>>>>> Ryan
>>>>>>
>>>>>
>>>
>>> --
>>> Edgar R
>>>
>>

-- 
Edgar R


Re: Nessie PRs

2021-03-15 Thread Edgar Rodriguez
FYI I think https://github.com/apache/iceberg/pull/2307 broke the CI, I'm
seeing the following errors:

FAILURE: Build failed with an exception.
29

30
*
What went wrong:
31
A
problem occurred configuring root project 'iceberg'.
32
>
Could not resolve all artifacts for configuration ':classpath'.
33

> Could not resolve
org.projectnessie:org.projectnessie.gradle.plugin:0.4.0.
34

Required by:
35

project :
36

> Could not resolve
org.projectnessie:org.projectnessie.gradle.plugin:0.4.0.
37

> Could not get resource '
https://jcenter.bintray.com/org/projectnessie/org.projectnessie.gradle.plugin/0.4.0/org.projectnessie.gradle.plugin-0.4.0.pom'.

38

> Could not HEAD '
https://jcenter.bintray.com/org/projectnessie/org.projectnessie.gradle.plugin/0.4.0/org.projectnessie.gradle.plugin-0.4.0.pom'.

39

> Read timed out

Cheers,

On Mon, Mar 15, 2021 at 2:54 PM Ryan Murray  wrote:

> Thanks a lot Dan!
>
> On Mon, Mar 15, 2021 at 4:58 PM Daniel Weeks 
> wrote:
>
>> Hey Ryan,  I just took care of the first one and I might have some time
>> to look over the others today or tomorrow (unless someone else gets to them
>> first).
>>
>> -Dan
>>
>> On Mon, Mar 15, 2021 at 7:03 AM Ryan Murray  wrote:
>>
>>> Hi All,
>>>
>>> I have a few Nessie PRs that I am hoping to try and get merged. Is there
>>> a committer who has a bit of free time? Happy to help w/ some of the
>>> details if the PRs aren't clear. I am @rymurr on the apache slack if you
>>> want a quick response.
>>>
>>> The nessie version bump is the most pressing and the easiest:
>>> https://github.com/apache/iceberg/pull/2307
>>> The PRs are under the nessie label:
>>>
>>> https://github.com/apache/iceberg/pulls?q=is%3Aopen+is%3Apr+label%3ANESSIE
>>> and one doc PR:
>>> https://github.com/apache/iceberg/pull/2292
>>>
>>> Best,
>>> Ryan
>>>
>>

-- 
Edgar R


Re: Migrating legacy snapshot daily Hive table concept to Iceberg

2021-03-15 Thread Edgar Rodriguez
Hi Ryan,

On Tue, Mar 9, 2021 at 5:54 AM Ryan Murray  wrote:

> Hey Edgar, Cheng Pan,
>
> I am not sure if you are aware of project nessie
> ? It _may_ suit your needs. Nessie applies
> git-like functionality to iceberg tables (in this case most useful are
> branches and tags).
>

Thanks for the suggestion. I did look at nessie and it looks really cool.


>
> In effect you would be pivoting the snapshot partition into the table
> itself and using nessie tags to represent the previous table snapshots. You
> could create a tag for each database snapshot with the date the snapshot
> was taken and the`main` branch would then receive your half hour updates. I
> think the major issue is that you would lose the `ds` partition column and
> have to use the `select * from tablename@tagname` syntax that nessie
> supports to query a specific `ds`, however it would provide you with the
> `snapshot-tag` concept you suggested above. A potential extra benefit is
> that all tables would be under the same tag so you would in effect have the
> same tag for the set of tables rather than an iceberg snapshot id per table.
>

Yeah, I think the main issue with this workflow is either maintaining the
`ds` way to query the tables - while in Iceberg we try to hide partitioning
from the user - or make it so that it's not too much of a disruptive
migration for the user. For instance, if Iceberg supported the snapshot-tag
we could use the Spark procedure to set the current snapshot using a tag,
which may be easier to use than the snapshot-id that right now it expects.

I think in general, it's a bit hard to track snapshot-ids in Iceberg
specially for making it easy to use when referring to them. All
snapshot-ids may not be relevant to users and an external mapping would be
needed to track them on certain specific points. I think while not having
the full features of nessie, snapshot-tags by themselves would still be
useful for folks using their own catalog/tools or vanilla Iceberg.

Thanks,
-- 
Edgar R


Re: Hive query with join of Iceberg table and Hive table

2021-03-12 Thread Edgar Rodriguez
Great! Thanks Peter for letting me know. I'll take a look.

Best,

On Fri, Mar 12, 2021 at 10:14 AM Peter Vary 
wrote:

> Hi Edgar,
>
> You might want to take a look at this:
> https://github.com/apache/iceberg/pull/2329
>
> The PR aims to update Hive table statistics to the HiveCatalog when any
> change to the table is committed. This solves the issue with the upstream
> Hive code, and might solve the issue with other versions as well.
>
> Thanks,
> Peter
>
> On Mar 4, 2021, at 09:43, Vivekanand Vellanki  wrote:
>
> Our concern is not specific to Iceberg. I am concerned about the memory
> requirement in caching a large number of splits.
>
> With Iceberg, estimating row counts when the query has predicates requires
> scanning the manifest list and manifest files to identify all the data
> files; and compute the row count estimates. While it is reasonable to cache
> these splits to avoid reading the manifest files twice, this increases the
> memory requirement. Also, query engines might want to handle row count
> estimation and split generation phases - row count estimation is required
> for the cost based optimiser. Split generation can be done in parallel by
> reading manifest files in parallel.
>
> It would be good to decouple row count estimation from split generation.
>
> On Wed, Mar 3, 2021 at 11:24 PM Ryan Blue 
> wrote:
>
>> I agree with the concern about caching splits, but doesn't the API cause
>> us to collect all of the splits into memory anyway? I thought there was no
>> way to return splits as an `Iterator` that lazily loads them. If that's the
>> case, then we primarily need to worry about cleanup and how long they are
>> kept around.
>>
>> I think it is also fairly reasonable to do the planning twice to avoid
>> the problem in Hive. Spark distributes the responsibility to each driver,
>> so jobs are separate and don't affect one another. If this is happening on
>> a shared Hive server endpoint then we probably have more of a concern about
>> memory consumption.
>>
>> Vivekanand, can you share more detail about how/where this is happening
>> in your case?
>>
>> On Wed, Mar 3, 2021 at 7:53 AM Edgar Rodriguez <
>> edgar.rodrig...@airbnb.com.invalid> wrote:
>>
>>> On Wed, Mar 3, 2021 at 1:48 AM Peter Vary 
>>> wrote:
>>>
>>>> Quick question @Edgar: Am I right that the table is created by Spark? I
>>>> think if it is created from Hive and we inserted the data from Hive, then
>>>> we should have the basic stats already collected and we should not need the
>>>> estimation (we might still do it, but probably we should not)
>>>>
>>>
>>> Yes, Spark creates the table. We don't write Iceberg tables with Hive.
>>>
>>>
>>>>
>>>> Also we should check if Hive expects the full size of the table, or the
>>>> size of the table after filters. If Hive collects this data by file
>>>> scanning I would expect that it would be adequate to start with unfiltered
>>>> raw size.
>>>>
>>>
>>> In this case Hive is performing the FS scan to find the raw size of the
>>> location to query - in this case since the table is unpartitioned (ICEBERG
>>> type) the location to query is the full table since Hive is not aware of
>>> Iceberg metadata. However, if the estimator is used it passes a
>>> TableScanOperator, which I assume could be used to gather some specific
>>> stats if present in the operator.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Peter
>>>>
>>>>
>>>> Vivekanand Vellanki  ezt írta (időpont: 2021. márc.
>>>> 3., Sze 5:15):
>>>>
>>>>> One of our concerns with caching the splits is the amount of memory
>>>>> required for this. If the filtering is not very selective and the table
>>>>> happens to be large, this increases the memory requirement to hold all the
>>>>> splits in memory.
>>>>>
>>>>
>>> I agree with this - caching the splits would be a concern with memory
>>> consumption; even now serializing/deserializing (probably another topic for
>>> discussion) splits in Hive for a query producing ~3.5K splits takes
>>> considerable time.
>>>
>>> Cheers,
>>> --
>>> Edgar R
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>

-- 
Edgar R


Migrating legacy snapshot daily Hive table concept to Iceberg

2021-03-08 Thread Edgar Rodriguez
Hi folks,

I’d like to request some feedback on how to use Iceberg to approach a use
case we have, that I believe some other folks could be facing, since this
was a pattern usually followed with Hive tables.

Use case:
1. We used to have database table snapshots exported daily at 0 UTC. Each
day a new partition is created with a materialized snapshot (e.g.
ds=2021-02-01, ds=2021-02-02, ...)
2. We have a lot of queries written against this legacy structure.
3. We would like to start migrating to Iceberg by writing a table snapshot
and periodically committing mutations (e.g. every half hour).
4. We are trying to retain the legacy interface (`ds` partition as a
snapshot) to support the myriad of existing queries, which sometimes target
multiple snapshots at the same time so that old queries continue to work,
while new queries are written directly against Iceberg tables using time
travel.

Issues:
An issue I see moving this use case to Iceberg is on the interface, as many
users already have queries using the `ds` partitioning column to use the
snapshot - also note that in this approach users NEED to know specifically
that they can only query these tables with a `ds` filter, otherwise they
could get duplicated rows. One thought we had to solve this was to use a
thin wrapper, for instance in Hive a custom table InputFormat that takes
the filter expression (with the `ds`) and maps it to a snapshot using a
JSON config file (which holds the snapshot-id to ds mapping); and something
similar for Spark. This solution is very custom to the use case, and makes
a lot of assumptions, but I guess the idea is to present this specific
interface to users while using Iceberg - however, this could be a
transitioning phase until user queries are fully migrated to using
snapshots directly.

I still think Iceberg would be a good candidate to avoid duplicating data
and simplify users' requirement on knowing the partitioning and its implied
meaning before querying the table.

How are other folks with the same use case solving this with Iceberg?



On Iceberg snapshots:
I know that in Iceberg we want to abstract partitioning as much as possible
from the user, since this is really powerful. My initial thought is to use
the natively supported table snapshots and time travel in Iceberg. However,
it’s not straightforward for users to use a snapshot-id, and snapshots may
not exactly correspond to the data at a given timestamp, only to the point
on when the change was applied to the table, e.g. If I want the table data
for 2021-01-01 00:00:00 UTC the commit that was created for that particular
cut-over was done in 2021-01-01 06:00:00 UTC, so using timestamp is not
straightforward either.
Would it make sense to introduce a `snapshot-tag` concept that could be
used to refer to a particular snapshot? I guess we could add it in the
Snapshot summary, but there’s no way to use that tag instead of the ID to
refer to the snapshot. This would allow us to tag specific snapshots and
let users use the tags to query the snapshot, simplifying a bit the
migration. Also, we’d need to make sure the tags are unique, same as the
snapshot ids. In a way I think of this as something similar to Git, where
snapshot-id is akin to commit hash and snapshot-tag is similar to a git
tag. I think this would simplify the way to use snapshots on queries.

I’m happy to hear other approaches. Thanks for reading and the comments in
advance!

Best,
-- 
Edgar R


Re: Hive query with join of Iceberg table and Hive table

2021-03-03 Thread Edgar Rodriguez
On Wed, Mar 3, 2021 at 1:48 AM Peter Vary 
wrote:

> Quick question @Edgar: Am I right that the table is created by Spark? I
> think if it is created from Hive and we inserted the data from Hive, then
> we should have the basic stats already collected and we should not need the
> estimation (we might still do it, but probably we should not)
>

Yes, Spark creates the table. We don't write Iceberg tables with Hive.


>
> Also we should check if Hive expects the full size of the table, or the
> size of the table after filters. If Hive collects this data by file
> scanning I would expect that it would be adequate to start with unfiltered
> raw size.
>

In this case Hive is performing the FS scan to find the raw size of the
location to query - in this case since the table is unpartitioned (ICEBERG
type) the location to query is the full table since Hive is not aware of
Iceberg metadata. However, if the estimator is used it passes a
TableScanOperator, which I assume could be used to gather some specific
stats if present in the operator.


>
> Thanks,
> Peter
>
>
> Vivekanand Vellanki  ezt írta (időpont: 2021. márc. 3.,
> Sze 5:15):
>
>> One of our concerns with caching the splits is the amount of memory
>> required for this. If the filtering is not very selective and the table
>> happens to be large, this increases the memory requirement to hold all the
>> splits in memory.
>>
>
I agree with this - caching the splits would be a concern with memory
consumption; even now serializing/deserializing (probably another topic for
discussion) splits in Hive for a query producing ~3.5K splits takes
considerable time.

Cheers,
-- 
Edgar R


Re: Hive query with join of Iceberg table and Hive table

2021-03-02 Thread Edgar Rodriguez
After a bit of further digging, I found that the issue is related to Hive
trying to find the input size (the Iceberg table) for the join at query
planning time. Since HiveIcebergStorageHandler does not implement
InputEstimator
<https://github.com/apache/hive/blob/branch-2.3/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2195>,
Hive tries to estimate the input size the same way as it would do for a
native Hive table, by scanning the FS listing the paths recursively
<https://github.com/apache/hadoop/blob/branch-2.8.5/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1489>
and adding the file lengths - in the case of Iceberg tables it would start
scanning from the table location since it's EXTERNAL unpartitioned - as
mentioned in the Hive Wiki
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=82903061#ConfigurationProperties-hive.fetch.task.conversion.threshold>
:

If target table is native, input length is calculated by summation of file
> lengths. If it's not native, the storage handler for the table can
> optionally implement the org.apache.hadoop.hive.ql.metadata.InputEstimator
> interface.


After adding the interface to the storage_handler and providing an
implementation returning an Estimation(-1, -1)
<https://hive.apache.org/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/metadata/InputEstimator.Estimation.html#Estimation-int-long->
the query works successfully in the expected amount of time - maybe a
better implementation can be done with the actual extimation. I assume this
is only an issue you hit when the underlying FS tree of the Iceberg table
is large and traversing the FS takes a long time, otherwise most likely
Hive would do the FS traversal and the query would make progress.

Should we make this change in the HiveIcebergStorageHandler?

Cheers,

On Tue, Mar 2, 2021 at 1:11 PM Peter Vary 
wrote:

> I have seen this kind of problem when the catalog was not configured for
> the table/session and we ended up using the default catalog instead of
> HiveCatalog
>
> On Mar 2, 2021, at 18:49, Edgar Rodriguez <
> edgar.rodrig...@airbnb.com.INVALID> wrote:
>
> Hi,
>
> I'm trying to run a simple query in Hive 2.3.4 with a join of a Hive table
> and an Iceberg table, each configured accordingly - Iceberg table has the
> `storage_handler` defined and running with MR engine.
>
> I'm using the `iceberg.mr.catalog.loader.class` class to load our internal
> catalog. In the logs I can see Hive loading the Iceberg table, but then I
> can see the Driver doing some traversal through the FS path under the table
> location, getting statuses for all data within the directory - this is not
> the behavior I see when querying an Iceberg table in Hive by itself, where
> I can see the splits being computed correctly.
> Due to this behavior, the query basically scans the full FS structure
> under the path - which if large it looks like it's stuck, however I do see
> the wire activity fetching the FS listings.
>
> Question is, has anyone experienced this behavior on querying Hive tables
> with joins on Iceberg tables? If so, what's the best way to approach this?
>
> Best,
> --
> Edgar R
>
>
>

-- 
Edgar R


Hive query with join of Iceberg table and Hive table

2021-03-02 Thread Edgar Rodriguez
Hi,

I'm trying to run a simple query in Hive 2.3.4 with a join of a Hive table
and an Iceberg table, each configured accordingly - Iceberg table has the
`storage_handler` defined and running with MR engine.

I'm using the `iceberg.mr.catalog.loader.class` class to load our internal
catalog. In the logs I can see Hive loading the Iceberg table, but then I
can see the Driver doing some traversal through the FS path under the table
location, getting statuses for all data within the directory - this is not
the behavior I see when querying an Iceberg table in Hive by itself, where
I can see the splits being computed correctly.
Due to this behavior, the query basically scans the full FS structure under
the path - which if large it looks like it's stuck, however I do see the
wire activity fetching the FS listings.

Question is, has anyone experienced this behavior on querying Hive tables
with joins on Iceberg tables? If so, what's the best way to approach this?

Best,
-- 
Edgar R


Re: Default TimeZone for unit tests

2021-03-01 Thread Edgar Rodriguez
Hi folks,

Thanks Peter for the quick fix!

I do think it'd be a good idea to have this kind of coverage to some
extent. Usually, a workflow some users follow is to only run locally the
modules that they modify and rely on the CI to run the full check which
takes longer, which makes room for these issues to land in master while
eventually someone will find the broken test. However, I do agree that we
probably should not spend a large amount of time on this - ideally if this
is possible in CI that'd be great e.g. having two CI jobs, one for UTC and
another for a different TZ.

Cheers,

On Mon, Mar 1, 2021 at 2:52 PM Ryan Blue  wrote:

> I'm not sure it would be worth separating out the timezone tests to do
> this. I think we catch these problems pretty quickly with the number of
> users building in different zones. Is this something we want to spend time
> on?
>
> On Mon, Mar 1, 2021 at 10:29 AM Russell Spitzer 
> wrote:
>
>> In the Spark Cassandra Connector we had a similar issue, we would
>> specifically spawn test JVM's with different default local time zones to
>> make sure we handled these use cases, I also would make our test dates ones
>> on gregorian calendar boundaries so being an hour off with result in a
>> timestamp that would end up actually being several days off so it was
>> clear.
>>
>> So maybe it makes sense to break out some timestamp specific tests and
>> have them run with different local timezones? Then you have a UTC, PST, CEU
>> or whatever test suites to run. If we scope this to just timestamp specific
>> tests it shouldn't be that much more expensive and I do think the coverage
>> is important.
>>
>> On Mon, Mar 1, 2021 at 12:25 PM Peter Vary 
>> wrote:
>>
>>> Hi Team,
>>>
>>> Last weekend I caused a little bit of stir by pushing changes which had
>>> a green run on CI, but was failing locally if the default TZ was different
>>> than UTC.
>>>
>>> Do we want to set the TZ of the CI tests to some random non-UTC TZ to
>>> catch these errors?
>>>
>>> Pros:
>>>
>>>- We can catch tests which are only working in UTC
>>>
>>>
>>> Cons:
>>>
>>>- I think the typical TZ is UTC in our target environments, so
>>>catching UTC problems might be more important
>>>
>>>
>>> I am interested in your thoughts about this.
>>>
>>> Thanks,
>>> Peter
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Edgar R


Re: About importing Hive tables and name mapping

2020-11-05 Thread Edgar Rodriguez
Hi Xiang,

On Thu, Nov 5, 2020 at 11:07 AM 李响  wrote:

> Dear community:
>
> I am using SparkTableUtil to import an existing Hive table to an Iceberg
> table.
> The ORC files of Hive table is an old version of ORC, so I set a name
> mapping (like: id 1 mapped to _col0 and id 2 mapped to _col1...) to the
> Iceberg table by using "schema.name-mapping.default" so that the matrics of
> ORC files could be built correctly during the import process.
>
> After that, I plan to write new data into the Iceberg table (using the ORC
> version 1.6.5 in the iceberg package), how could I deal with that name
> mapping used for importing ? Should I remove that? Does that name mapping
> do any harm when reading/writing from/to the new ORC file?
>

If I understand correctly the name-mapping would only apply if there were
no Iceberg IDs found in the ORC file as type attributes, which is the case
for the imported data. All new data you write with Iceberg/ORC will have
the Iceberg field-id stored as a type attribute, so when reading those new
files the name-mapping should have no effect since the read path will
detect the Iceberg field-ids.

Cheers,
-- 
Edgar R


Re: [VOTE] Release Apache Iceberg 0.10.0 RC4

2020-11-05 Thread Edgar Rodriguez
+1 non-binding for RC4. Tested with internal tests in cluster, validated
Spark write and Hive reads.

On Thu, Nov 5, 2020 at 5:56 AM Mass Dosage  wrote:

> +1 non-binding on RC4. I tested out the Hive read path on a distributed
> cluster using HadoopTables.
>
> On Thu, 5 Nov 2020 at 04:46, Dongjoon Hyun 
> wrote:
>
>> +1 for 0.10.0 RC4.
>>
>> Bests,
>> Dongjoon.
>>
>> On Wed, Nov 4, 2020 at 7:17 PM Jingsong Li 
>> wrote:
>>
>>> +1
>>>
>>> 1. Download the source tarball, signature (.asc), and checksum
>>> (.sha512):   OK
>>> 2. Import gpg keys: download KEYS and run gpg --import
>>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>>> 3. Verify the signature by running: gpg --verify
>>> apache-iceberg-xx.tar.gz.asc:  OK
>>> 4. Verify the checksum by running: sha512sum -c
>>> apache-iceberg-xx.tar.gz.sha512 :  OK
>>> 5. Untar the archive and go into the source directory: tar xzf
>>> apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
>>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>>
>>> Best,
>>> Jingsong
>>>
>>> On Thu, Nov 5, 2020 at 7:38 AM Ryan Blue 
>>> wrote:
>>>
 +1

- Validated checksum and signature
- Ran license checks
- Built and ran tests
- Queried a Hadoop FS table created with 0.9.0 in Spark 3.0.1
- Created a Hive table from Spark 3.0.1
- Tested metadata tables from Spark
- Tested Hive and Hadoop table reads in Hive 2.3.7

 I was able to read both Hadoop and Hive tables created in Spark from
 Hive using:

 add jar /home/blue/Downloads/iceberg-hive-runtime-0.10.0.jar;
 create external table hadoop_table
   stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler'
   location 'file:/home/blue/tmp/hadoop-warehouse/default/test';
 select * from hadoop_table;

 set iceberg.mr.catalog=hive;
 select * from hive_table;

 The hive_table needed engine.hive.enabled=true set in table properties
 by Spark using:

 alter table hive_table set tblproperties ('engine.hive.enabled'='true')

 Hive couldn’t read the #snapshots metadata table for Hadoop. It failed
 with this error:

 Failed with exception 
 java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.ClassCastException: java.lang.Long cannot be cast to 
 java.time.OffsetDateTime

 I also couldn’t read the Hadoop table once iceberg.mr.catalog was set
 in my environment, so I think we have a bit more work to do to clean up
 Hive table configuration.

 On Wed, Nov 4, 2020 at 12:54 AM Ryan Murray  wrote:

> +1 (non-binding)
>
> 1. Download the source tarball, signature (.asc), and checksum
> (.sha512):   OK
> 2. Import gpg keys: download KEYS and run gpg --import
> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
> 3. Verify the signature by running: gpg --verify
> apache-iceberg-xx.tar.gz.asc:  I got a warning "gpg: WARNING: This key is
> not certified with a trusted signature! gpg:  There is no
> indication that the signature belongs to the owner." but it passed
> 4. Verify the checksum by running: sha512sum -c
> apache-iceberg-xx.tar.gz.sha512 :  OK
> 5. Untar the archive and go into the source directory: tar xzf
> apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
> 6. Run RAT checks to validate license headers: dev/check-license: OK
> 7. Build and test the project: ./gradlew build (use Java 8 & Java 11)
> :   OK
>
>
> On Wed, Nov 4, 2020 at 2:56 AM OpenInx  wrote:
>
>> +1 for 0.10.0 RC4
>>
>> 1. Download the source tarball, signature (.asc), and checksum
>> (.sha512):   OK
>> 2. Import gpg keys: download KEYS and run gpg --import
>> /path/to/downloaded/KEYS (optional if this hasn’t changed) :  OK
>> 3. Verify the signature by running: gpg --verify
>> apache-iceberg-xx.tar.gz.asc:  OK
>> 4. Verify the checksum by running: sha512sum -c
>> apache-iceberg-xx.tar.gz.sha512 :  OK
>> 5. Untar the archive and go into the source directory: tar xzf
>> apache-iceberg-xx.tar.gz && cd apache-iceberg-xx:  OK
>> 6. Run RAT checks to validate license headers: dev/check-license: OK
>> 7. Build and test the project: ./gradlew build (use Java 8) :   OK
>>
>> On Wed, Nov 4, 2020 at 8:25 AM Anton Okolnychyi
>>  wrote:
>>
>>> Hi everyone,
>>>
>>> I propose the following RC to be released as official Apache Iceberg
>>> 0.10.0 release.
>>>
>>> The commit id is d39fad00b7dded98121368309f381473ec21e85f
>>> * This corresponds to the tag: apache-iceberg-0.10.0-rc4
>>> *
>>> https://github.com/apache/iceberg/commits/apache-iceberg-0.10.0-rc4
>>> *
>>> https://github.com/apache/iceberg/tree/d39fad00b7dded98121368309f381473ec

Re: Hive Iceberg writes

2020-08-27 Thread Edgar Rodriguez
Hi folks,

We have not started to work on this either, but we've discussed this
internally on whether supporting Hive writes or not. Our first priority
right now is getting Hive reads in production to have read compatibility
with our existing Hive clients. We'd be interested in this, however, at
Airbnb we're moving to Spark so writes in Hive most likely won't be on top
of our list.

Thanks!

Cheers,

On Thu, Aug 27, 2020 at 12:53 AM Mass Dosage  wrote:

> We're definitely interested in this too but haven't started work on it
> yet. It has been discussed at our community syncs as something quite a few
> people are interested in so if nobody else responds a good starting point
> would probably be an early WIP PR that everyone can follow and contribute
> to.
>
> Thanks,
>
> Adrian
>
> On Wed, 26 Aug 2020 at 17:35, Ryan Blue  wrote:
>
>> I think Edgar and Adrien who have been contributing support for ORC and
>> Hive are interested in this as well.
>>
>> On Wed, Aug 26, 2020 at 9:22 AM Peter Vary 
>> wrote:
>>
>>> Hi Team,
>>>
>>> We are thinking about implementing HiveOutputFormat, so writes through
>>> Hive can work as well.
>>> Has anybody working on this? Do you know any ongoing effort related to
>>> Hive writes?
>>> Asking because we would like to prevent duplicate effort.
>>> Also if anyone has some good pointers to start for an Iceberg noobie, it
>>> would be good.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Edgar R


Re: [VOTE] Release Apache Iceberg 0.9.0 RC5

2020-07-13 Thread Edgar Rodriguez
+1 (non-binding)

- Verified signatures
- Verified checksum
- Build from src tarball and ran tests
- Ran internal test suite, they pass

On Mon, Jul 13, 2020 at 11:46 AM Pavan Lanka 
wrote:

> +1 (non-binding)
>
>
>- Environment
>   - OSX
>   - openjdk 1.8.0_252
>- Build from source with tests
>   - Build time ~7mins
>   - Except for some warnings looks good
>
>
>
> On Jul 10, 2020, at 9:20 AM, Ryan Murray  wrote:
>
> 1. Verify the signature: OK
> 2. Verify the checksum: OK
> 3. Untar the archive tarball: OK
> 4. Run RAT checks to validate license headers: RAT checks passed
> 5. Build and test the project: all unit tests passed.
>
> +1 (non-binding)
>
> I did see that my build took >12 minutes and used all 100% of all 8 cores
> & 32GB of memory (openjdk-8 ubuntu 18.04) which I haven't noticed before.
> On Fri, Jul 10, 2020 at 4:37 AM OpenInx  wrote:
>
>> I followed the verify guide here (
>> https://lists.apache.org/thread.html/rd5e6b1656ac80252a9a7d473b36b6227da91d07d86d4ba4bee10df66%40%3Cdev.iceberg.apache.org%3E)
>> :
>>
>> 1. Verify the signature: OK
>> 2. Verify the checksum: OK
>> 3. Untar the archive tarball: OK
>> 4. Run RAT checks to validate license headers: RAT checks passed
>> 5. Build and test the project: all unit tests passed.
>>
>> +1 (non-binding).
>>
>> On Fri, Jul 10, 2020 at 9:46 AM Ryan Blue 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I propose the following RC to be released as the official Apache Iceberg
>>> 0.9.0 release.
>>>
>>> The commit id is 4e66b4c10603e762129bc398146e02d21689e6dd
>>> * This corresponds to the tag: apache-iceberg-0.9.0-rc5
>>> * https://github.com/apache/iceberg/commits/apache-iceberg-0.9.0-rc5
>>> * https://github.com/apache/iceberg/tree/4e66b4c1
>>>
>>> The release tarball, signature, and checksums are here:
>>> *
>>> https://dist.apache.org/repos/dist/dev/iceberg/apache-iceberg-0.9.0-rc5/
>>>
>>> You can find the KEYS file here:
>>> * https://dist.apache.org/repos/dist/dev/iceberg/KEYS
>>>
>>> Convenience binary artifacts are staged in Nexus. The Maven repository
>>> URL is:
>>> *
>>> https://repository.apache.org/content/repositories/orgapacheiceberg-1008/
>>>
>>> This release includes support for Spark 3 and vectorized reads for flat
>>> schemas in Spark.
>>>
>>> Please download, verify, and test.
>>>
>>> Please vote in the next 72 hours.
>>>
>>> [ ] +1 Release this as Apache Iceberg 0.9.0
>>> [ ] +0
>>> [ ] -1 Do not release this because...
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>

-- 
Edgar R


Re: failing tests on master

2020-06-26 Thread Edgar Rodriguez
There's already a fix for this in
https://github.com/apache/iceberg/pull/1127

Cheers,

On Fri, Jun 26, 2020 at 5:26 AM Mass Dosage  wrote:

> Hello all,
>
> For the past week or so I've noticed failing builds on a local checkout of
> master.
>
> I have raised an issue here:
>
> https://github.com/apache/iceberg/issues/1113
>
> (there was initially one failing test, there are now two)
>
> Someone else raised a similar issue with one of the same failing tests and
> then another one:
>
> https://github.com/apache/iceberg/issues/1116
>
> What can we do in order to get these 3 failing tests resolved? I think the
> first step should be to get the Travis build to fail on them so it's clear
> to everyone that there is a problem here and a failing master build should
> be #1 priority to resolve. Perhaps the Travis build could be changed to use
> a different timezone? I generally recommend using a timezone, charset etc.
> in the CI system that is different to whatever most of the developers have
> as their default in order to catch these kinds of issues.
>
> This is making it really hard to develop with confidence as one can't tell
> whether failing tests are due to changes I am making or not.
>
> Thanks,
>
> Adrian
>


-- 
Edgar R


Re: S3 example in Java

2020-06-24 Thread Edgar Rodriguez
Hi Chen,

Since S3 does not have atomic rename operation, for create/write/read of
tables in S3 currently the only way to do it is via the HiveCatalog
implementation which requires a Hive metastore with Lock support to provide
the atomic commit required by Iceberg. You can alternatively write your
custom Catalog implementation in which you set up your custom atomic commit
mechanism as shown in http://iceberg.apache.org/custom-catalog/.

Cheers,

On Wed, Jun 24, 2020 at 6:12 AM Chen Song  wrote:

> Hi
>
> Are there any Java examples to create/write/read tables backed by S3? I
> tried to search in the documentation and github but did not find anything.
>
> Thanks
> Chen
>
>

-- 
Edgar R


Re: Shall we start a regular community sync up?

2020-06-15 Thread Edgar Rodriguez
Hi Ryan,

I'd like to attend the regular community syncs, could you send me an invite?

Thanks!

- Edgar

On Wed, Mar 25, 2020 at 6:38 PM Ryan Blue  wrote:

> Will do.
>
> On Wed, Mar 25, 2020 at 6:36 PM Jun Ma  wrote:
>
>> Hi Ryan,
>>
>> Thanks for driving the sync up meeting. Could you please add Fan Diao(
>> fan.dia...@gmail.com) and myself to the invitation list?
>>
>> Thanks,
>> Jun Ma
>>
>> On Mon, Mar 23, 2020 at 9:57 PM OpenInx  wrote:
>>
>>> Hi Ryan
>>>
>>> I received your invitation. Some guys from our Flink teams also want to
>>> join the hangouts  meeting. Do we need
>>> also send an extra invitation to them ?  Or could them just join the
>>> meeting with entering the meeting address[1] ?
>>>
>>> If need so, please let the following guys in:
>>> 1. ykt...@gmail.com
>>> 2. imj...@gmail.com
>>> 3. yuzhao@gmail.com
>>>
>>> BTW,  I've written a draft to discuss in the meeting [2],  anyone could
>>> enrich the topics want to discuss.
>>>
>>> Thanks.
>>>
>>> [1]. https://meet.google.com/_meet/xdx-rknm-uvm
>>> [2].
>>> https://docs.google.com/document/d/1wXTHGYhc7sDhP5DxlByba0S5YguNLWwY98FAp6Tx2mw/edit#
>>>
>>> On Mon, Mar 23, 2020 at 5:35 AM Ryan Blue 
>>> wrote:
>>>
 I invited everyone that replied to this thread and the people that were
 on the last invite.

 If you have specific topics you'd like to put on the agenda, please
 send them to me!

 On Sun, Mar 22, 2020 at 2:28 PM Ryan Blue  wrote:

> Let's go with Wednesday. I'll send out an invite.
>
> On Sun, Mar 22, 2020 at 1:36 PM John Zhuge  wrote:
>
>> 5-5:30 pm work for me. Prefer Wednesdays.
>>
>> On Sun, Mar 22, 2020 at 1:33 PM Romin Parekh 
>> wrote:
>>
>>> Hi folks,
>>>
>>> Both times slots work for me next week. Can we confirm a day?
>>>
>>> Thanks,
>>> Romin
>>>
>>> Sent from my iPhone
>>>
>>> > On Mar 20, 2020, at 11:38 PM, Jun H.  wrote:
>>> >
>>> > The schedule works for me.
>>> >
>>> >> On Thu, Mar 19, 2020 at 6:55 PM Junjie Chen <
>>> chenjunjied...@gmail.com> wrote:
>>> >>
>>> >> The same time works for me as well.
>>> >>
>>> >>> On Fri, Mar 20, 2020 at 9:43 AM Gautam 
>>> wrote:
>>> >>>
>>> >>> 5 / 5:30pm any day of next week works for me.
>>> >>>
>>> >>> On Thu, Mar 19, 2020 at 6:07 PM 李响  wrote:
>>> 
>>>  5 or 5:30 PM (UTC-7, is it PDT now) in any day works for me.
>>> Looking forward to it 8-)
>>> 
>>>  On Fri, Mar 20, 2020 at 8:17 AM RD  wrote:
>>> >
>>> > Same time works for me too!
>>> >
>>> > On Thu, Mar 19, 2020 at 4:45 PM Xabriel Collazo Mojica
>>>  wrote:
>>> >>
>>> >> 5pm or 5:30pm PT  any day next week would work for me.
>>> >>
>>> >> Thanks for restoring the community sync up!
>>> >>
>>> >> Xabriel J Collazo Mojica  |  Sr Computer Scientist II  |
>>> Adobe
>>> >>
>>> >> On 3/18/20, 6:45 PM, "justin_cof...@apple.com on behalf of
>>> Justin Q Coffey" >> j...@apple.com.INVALID> wrote:
>>> >>
>>> >>Any chance we could actually do 5:30pm PST?  I'm a bit of
>>> a lurker, but this roadmap is important to mine and I have a daily at 
>>> 5pm
>>> :(.
>>> >>
>>> >>-Justin
>>> >>
>>> >>> On Mar 18, 2020, at 6:43 PM, Saisai Shao <
>>> sai.sai.s...@gmail.com> wrote:
>>> >>>
>>> >>> 5pm PST in any day works for me.
>>> >>>
>>> >>> Looking forward to it.
>>> >>>
>>> >>> Thanks
>>> >>> Saisai
>>> >>
>>> >>
>>> >>
>>> 
>>> 
>>>  --
>>> 
>>>    李响 Xiang Li
>>> 
>>>  手机 cellphone :+86-136-8113-8972
>>>  邮件 e-mail  :wate...@gmail.com
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards
>>>
>>
>>
>> --
>> John Zhuge
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


 --
 Ryan Blue
 Software Engineer
 Netflix

>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Edgar R


Re: New committer and PPMC member, Anton Okolnychyi

2019-08-30 Thread Edgar Rodriguez
Nice! Congratulations, Anton!

Cheers,

On Fri, Aug 30, 2019 at 1:42 PM Dongjoon Hyun 
wrote:

> Congratulations, Anton! :D
>
> Bests,
> Dongjoon.
>
> On Fri, Aug 30, 2019 at 10:06 AM Ryan Blue  wrote:
>
>> I'd like to congratulate Anton Okolnychyi, who was just invited to join
>> the Iceberg committers and PPMC!
>>
>> Thanks for all your contributions, Anton!
>>
>> rb
>>
>> --
>> Ryan Blue
>>
>

-- 
Edgar Rodriguez


Re: Spark version

2019-08-22 Thread Edgar Rodriguez
I've tested with 2.4.3 and current Iceberg master branch works, at least
last time I tested it.

Cheers,

On Thu, Aug 22, 2019 at 3:00 PM Ryan Blue  wrote:

> Iceberg should work with all of the 2.4.x releases. I don't think there
> have been any changes to the DSv2 API in patch releases.
>
> On Thu, Aug 22, 2019 at 2:51 PM RD  wrote:
>
>> Hi Iceberg devs,
>>We are in process of upgrading our Spark version to support Iceberg.
>> We wanted to know if Iceberg, would in the near term move to Spark 2.4.3?
>> If that's the case, we will do the work once and migrate to Spark 2.4.3
>> instead of Spark 2.4.0.
>>
>> -Best,
>>
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


-- 
Edgar Rodriguez


Re: Are we going to use Apache JIRA instead of Github issues

2019-08-19 Thread Edgar Rodriguez
I don't have a strong preference. I've worked with both of them but I
appreciate the simplicity of Github and the fast search. I feel like JIRA
can become very complex quickly and also requires a lot of labels,
versions, etc to track it so in that sense for Github it may also require
some of this but probably a bit simpler; there's less fields to fill out in
Github for an issue.

I believe in both cases there's a need for curation of the issues, so I'd
favor simplicity.

Anyways, my two cents.
Thanks.

Cheers,
Edgar

On Sun, Aug 18, 2019 at 7:43 PM Saisai Shao  wrote:

>  The issue linking, Fix Version, and assignee features of JIRA are also
>> helpful communication and organization tools.
>>
>
> Yes, I think so. Github issues seems a little bit simple, there're not so
> many status to track the issue unless we create bunch of labels.
>
> Wes McKinney  于2019年8月17日周六 上午2:37写道:
>
>> One significant issue with GitHub issues for ASF projects is that
>> non-committers cannot edit issue or PR metadata (labels, requesting
>> reviews, etc). The lack of formalism around Resolved and Closed states can
>> place an extra communication burden to explain why an issue is closed.
>> Sometimes projects use GitHub labels like 'wontfix'. The issue linking, Fix
>> Version, and assignee features of JIRA are also helpful communication and
>> organization tools.
>>
>> In other projects I have found JIRA easier to keep a larger number of
>> people, release milestones, and issues organized. I can't imagine changing
>> to GitHub issues in Apache Arrow, for example
>>
>> On Fri, Aug 16, 2019, 1:19 PM Ryan Blue 
>> wrote:
>>
>>> I prefer to use github instead of JIRA because it is simpler and has
>>> better search (in my opinion). I'm just one vote, though, so if most people
>>> prefer to move to JIRA I'm open to it.
>>>
>>> What do you think is missing compared to JIRA?
>>>
>>> On Fri, Aug 16, 2019 at 3:09 AM Saisai Shao 
>>> wrote:
>>>
>>>> Hi Team,
>>>>
>>>> Seems Iceberg project uses Github issues instead of JIRA. IMHO JIRA is
>>>> more powerful and easy to manage, most of the Apache projects use JIRA to
>>>> track everything, any plan to move to JIRA or we stick on using Github
>>>> issues?
>>>>
>>>> Thanks
>>>> Saisai
>>>>
>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
Edgar Rodriguez


Re: Iceberg in Spark 3.0.0

2019-08-08 Thread Edgar Rodriguez
On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:

> I think it's a great idea to branch and get ready for Spark 3.0.0. Right
> now, I'm focused on getting a release out, but I can review patches for
> Spark 3.0.
>
> Anyone know if there are nightly builds of Spark 3.0 to test with?
>

Seems like there're nightly snapshots built in
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
-
I've started setting something up with these snapshots so I can probably
start working on this.

Thanks!

Cheers,
-- 
Edgar Rodriguez


Iceberg in Spark 3.0.0

2019-08-07 Thread Edgar Rodriguez
Hi everyone,

I was wondering if there's a branch tracking the changes happening in Spark
3.0.0 for Iceberg. The DataSource V2 API has substantially changed from the
one implemented in Iceberg master branch and since Spark 3.0.0 would allow
us to introduce Spark SQL support then it seems interesting to start
tracking those changes to start evaluating some of the support as it
evolves.

Thanks.

Cheers,
-- 
Edgar Rodriguez


Re: [DISCUSS] Write-audit-publish support

2019-07-22 Thread Edgar Rodriguez
I think this use case is pretty helpful in most data environments, we do
the same sort of stage-check-publish pattern to run quality checks.
One question is, if say the audit part fails, is there a way to expire the
snapshot or what would be the workflow that follows?

Best,
Edgar

On Mon, Jul 22, 2019 at 9:59 AM Mouli Mukherjee 
wrote:

> This would be super helpful. We have a similar workflow where we do some
> validation before letting the downstream consume the changes.
>
> Best,
> Mouli
>
> On Mon, Jul 22, 2019 at 9:18 AM Filip  wrote:
>
>> This definitely sounds interesting. Quick question on whether this
>> presents impact on the current Upserts spec? Or is it maybe that we are
>> looking to associate this support for append-only commits?
>>
>> On Mon, Jul 22, 2019 at 6:51 PM Ryan Blue 
>> wrote:
>>
>>> Audits run on the snapshot by setting the snapshot-id read option to
>>> read the WAP snapshot, even though it has not (yet) been the current table
>>> state. This is documented in the time travel
>>> <http://iceberg.apache.org/spark/#time-travel> section of the Iceberg
>>> site.
>>>
>>> We added a stageOnly method to SnapshotProducer that adds the snapshot
>>> to table metadata, but does not make it the current table state. That is
>>> called by the Spark writer when there is a WAP ID, and that ID is embedded
>>> in the staged snapshot’s metadata so processes can find it.
>>>
>>> I'll add a PR with this code, since there is interest.
>>>
>>> rb
>>>
>>> On Mon, Jul 22, 2019 at 2:17 AM Anton Okolnychyi 
>>> wrote:
>>>
>>>> I would also support adding this to Iceberg itself. I think we have a
>>>> use case where we can leverage this.
>>>>
>>>> @Ryan, could you also provide more info on the audit process?
>>>>
>>>> Thanks,
>>>> Anton
>>>>
>>>> On 20 Jul 2019, at 04:01, RD  wrote:
>>>>
>>>> I think this could be useful. When we ingest data from Kafka, we do a
>>>> predefined set of checks on the data. We can potentially utilize something
>>>> like this to check for sanity before publishing.
>>>>
>>>> How is the auditing process suppose to find the new snapshot , since it
>>>> is not accessible from the table. Is it by convention?
>>>>
>>>> -R
>>>>
>>>> On Fri, Jul 19, 2019 at 2:01 PM Ryan Blue 
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> At Netflix, we have a pattern for building ETL jobs where we write
>>>>> data, then audit the result before publishing the data that was written to
>>>>> a final table. We call this WAP for write, audit, publish.
>>>>>
>>>>> We’ve added support in our Iceberg branch. A WAP write creates a new
>>>>> table snapshot, but doesn’t make that snapshot the current version of the
>>>>> table. Instead, a separate process audits the new snapshot and updates the
>>>>> table’s current snapshot when the audits succeed. I wasn’t sure that this
>>>>> would be useful anywhere else until we talked to another company this week
>>>>> that is interested in the same thing. So I wanted to check whether this is
>>>>> a good feature to include in Iceberg itself.
>>>>>
>>>>> This works by staging a snapshot. Basically, Spark writes data as
>>>>> expected, but Iceberg detects that it should not update the table’s 
>>>>> current
>>>>> stage. That happens when there is a Spark property, spark.wap.id,
>>>>> that indicates the job is a WAP job. Then any table that has WAP enabled 
>>>>> by
>>>>> the table property write.wap.enabled=true will stage the new snapshot
>>>>> instead of fully committing, with the WAP ID in the snapshot’s metadata.
>>>>>
>>>>> Is this something we should open a PR to add to Iceberg? It seems a
>>>>> little strange to make it appear that a commit has succeeded, but not
>>>>> actually change a table, which is why we didn’t submit it before now.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> rb
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Netflix
>>>>>
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>
>> --
>> Filip Bocse
>>
>

-- 
Edgar Rodriguez