Re: Append file and update properties in a single transaction

2020-12-07 Thread Omar Aloraini
Wow, Thanks a lot. I will try it tomorrow. 

On Mon, Dec 7, 2020, 9:25 PM Ryan Blue  wrote:

> Transactions do support property updates:
> https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L106-L111
>
> The commit check is to ensure that the last operation was complete before
> adding a new operation. You need to call commit on each operation created
> from the transaction or transaction table, but it doesn't modify the table
> until the entire transaction is committed.
>
> On Mon, Dec 7, 2020 at 10:18 AM Omar Aloraini 
> wrote:
>
>> Hi Ryan, thanks for the reply,
>>
>> I can't recall the class name (will update you once I check), I think it
>> was TransactionalTable, every method that modifies the table has a check
>> weather the last part of a transaction (belonging to the same object
>> created by table.newTransaction) was committed, i.e. newAppend,
>> updateProperties and so on.
>>
>> I'm not familiar with how iceberg performs a transaction at a low level,
>> be it through hdfs rename or else, I'll look into it tomorrow and if it's
>> not too difficult, I would like to work on it.
>>
>> I will update you with details once I am at work tomorrow.
>>
>> Regards
>>
>> On Mon, Dec 7, 2020, 9:05 PM Ryan Blue  wrote:
>>
>>> Omar,
>>>
>>> You can append files and update properties. You just need to create a
>>> transaction using `newTransaction` in the `Table` API.
>>>
>>> rb
>>>
>>> On Sun, Dec 6, 2020 at 7:16 AM Omar Aloraini 
>>> wrote:
>>>
 Hello everyone,

 I'm trying to append new files and update the table properties in
 single transaction, but it seems from the source code that I can't. Is
 their a workaround for this?

 Regards

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Append file and update properties in a single transaction

2020-12-07 Thread Ryan Blue
Transactions do support property updates:
https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseTransaction.java#L106-L111

The commit check is to ensure that the last operation was complete before
adding a new operation. You need to call commit on each operation created
from the transaction or transaction table, but it doesn't modify the table
until the entire transaction is committed.

On Mon, Dec 7, 2020 at 10:18 AM Omar Aloraini 
wrote:

> Hi Ryan, thanks for the reply,
>
> I can't recall the class name (will update you once I check), I think it
> was TransactionalTable, every method that modifies the table has a check
> weather the last part of a transaction (belonging to the same object
> created by table.newTransaction) was committed, i.e. newAppend,
> updateProperties and so on.
>
> I'm not familiar with how iceberg performs a transaction at a low level,
> be it through hdfs rename or else, I'll look into it tomorrow and if it's
> not too difficult, I would like to work on it.
>
> I will update you with details once I am at work tomorrow.
>
> Regards
>
> On Mon, Dec 7, 2020, 9:05 PM Ryan Blue  wrote:
>
>> Omar,
>>
>> You can append files and update properties. You just need to create a
>> transaction using `newTransaction` in the `Table` API.
>>
>> rb
>>
>> On Sun, Dec 6, 2020 at 7:16 AM Omar Aloraini 
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I'm trying to append new files and update the table properties in single
>>> transaction, but it seems from the source code that I can't. Is their a
>>> workaround for this?
>>>
>>> Regards
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Append file and update properties in a single transaction

2020-12-07 Thread Omar Aloraini
Hi Ryan, thanks for the reply,

I can't recall the class name (will update you once I check), I think it
was TransactionalTable, every method that modifies the table has a check
weather the last part of a transaction (belonging to the same object
created by table.newTransaction) was committed, i.e. newAppend,
updateProperties and so on.

I'm not familiar with how iceberg performs a transaction at a low level, be
it through hdfs rename or else, I'll look into it tomorrow and if it's not
too difficult, I would like to work on it.

I will update you with details once I am at work tomorrow.

Regards

On Mon, Dec 7, 2020, 9:05 PM Ryan Blue  wrote:

> Omar,
>
> You can append files and update properties. You just need to create a
> transaction using `newTransaction` in the `Table` API.
>
> rb
>
> On Sun, Dec 6, 2020 at 7:16 AM Omar Aloraini 
> wrote:
>
>> Hello everyone,
>>
>> I'm trying to append new files and update the table properties in single
>> transaction, but it seems from the source code that I can't. Is their a
>> workaround for this?
>>
>> Regards
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Sync notes for 2 December 2020

2020-12-07 Thread Ryan Blue
Thanks for the note, Peter. No problem if it doesn't get done, that was
just one that I thought we may want to include. Since we were unable to ask
anyone actually working on it, we knew it was aspirational. We also know it
could be a stretch to get the release out this year as well. Have a great
holiday!

On Mon, Dec 7, 2020 at 8:11 AM Peter Vary  wrote:

> Hi Team,
>
> I will be OOO starting from the end of this week. If someone wants to pick
> up Hive config changes, feel free to grab it. I am afraid I will not have
> enough time to do it this year.
>
> Thanks,
> Peter
>
> On Dec 5, 2020, at 02:13, Ryan Blue  wrote:
>
> Hi everyone,
>
> I just wrote up my notes for the sync on Wednesday. Feel free to comment
> if I’ve missed anything or didn’t remember clearly.
>
> Here’s a quick summary of the main points:
>
>- Highlights: Spark stored procedures are available, Glue and Nessie
>catalogs are committed, and Hive now supports DDL and projection pushdown!
>Thanks to Anton, Jack, Ryan M., Peter, Christine, Marton, Adrian, and
>Adrien!
>- We’re going to aim for an 0.11.0 release in the next couple weeks to
>avoid waiting until January
>   - It would be nice to get the Hive config changes in if possible
>   - Will try to get MERGE INTO done in Spark (DELETE FROM was merged
>   today!)
>   - Will also aim to get in Flink CDC writes, as well as Flink
>   streaming reads
>   - Please speak up if you need something specific!
>- Support for relative paths is a good idea, will probably add them to
>the spec
>
> The full notes are here:
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.r951w2dduwmy
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: Append file and update properties in a single transaction

2020-12-07 Thread Ryan Blue
Omar,

You can append files and update properties. You just need to create a
transaction using `newTransaction` in the `Table` API.

rb

On Sun, Dec 6, 2020 at 7:16 AM Omar Aloraini 
wrote:

> Hello everyone,
>
> I'm trying to append new files and update the table properties in single
> transaction, but it seems from the source code that I can't. Is their a
> workaround for this?
>
> Regards
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: Shall we start a regular community sync up?

2020-12-07 Thread Guy Khazma
Hi,

I will be glad to get an invitation as well.
Thanks!

On 2020/12/01 19:23:14, Russell Spitzer  wrote: 
> Invite sent
> 
> On Tue, Dec 1, 2020 at 1:20 PM Wing Yew Poon 
> wrote:
> 
> > I'd like to attend the community syncs as well. Can you please send me an
> > invite?
> > Thanks,
> > Wing Yew Poon
> >
> > On Thu, Nov 19, 2020 at 9:25 PM Chitresh Kakwani <
> > chitreshkakw...@gmail.com> wrote:
> >
> >> Hi Ryan,
> >>
> >> Could you please add me to the invitation list as well ? New entrant.
> >> Interested in Iceberg's roadmap.
> >>
> >> Regards,
> >> Chitresh Kakwani
> >>
> >> On Thu, Nov 19, 2020 at 6:21 PM Vivekanand Vellanki 
> >> wrote:
> >>
> >>> Hi Ryan,
> >>>
> >>> I'd like to attend the regular community syncs. Can you send me an
> >>> invite?
> >>>
> >>> Thanks
> >>> Vivek
> >>>
> >>> On Mon, Jun 15, 2020 at 11:16 PM Edgar Rodriguez
> >>>  wrote:
> >>>
>  Hi Ryan,
> 
>  I'd like to attend the regular community syncs, could you send me
>  an invite?
> 
>  Thanks!
> 
>  - Edgar
> 
>  On Wed, Mar 25, 2020 at 6:38 PM Ryan Blue 
>  wrote:
> 
> > Will do.
> >
> > On Wed, Mar 25, 2020 at 6:36 PM Jun Ma  wrote:
> >
> >> Hi Ryan,
> >>
> >> Thanks for driving the sync up meeting. Could you please add Fan Diao(
> >> fan.dia...@gmail.com) and myself to the invitation list?
> >>
> >> Thanks,
> >> Jun Ma
> >>
> >> On Mon, Mar 23, 2020 at 9:57 PM OpenInx  wrote:
> >>
> >>> Hi Ryan
> >>>
> >>> I received your invitation. Some guys from our Flink teams also want
> >>> to join the hangouts  meeting. Do we need
> >>> also send an extra invitation to them ?  Or could them just join the
> >>> meeting with entering the meeting address[1] ?
> >>>
> >>> If need so, please let the following guys in:
> >>> 1. ykt...@gmail.com
> >>> 2. imj...@gmail.com
> >>> 3. yuzhao@gmail.com
> >>>
> >>> BTW,  I've written a draft to discuss in the meeting [2],  anyone
> >>> could enrich the topics want to discuss.
> >>>
> >>> Thanks.
> >>>
> >>> [1]. https://meet.google.com/_meet/xdx-rknm-uvm
> >>> [2].
> >>> https://docs.google.com/document/d/1wXTHGYhc7sDhP5DxlByba0S5YguNLWwY98FAp6Tx2mw/edit#
> >>>
> >>> On Mon, Mar 23, 2020 at 5:35 AM Ryan Blue 
> >>> wrote:
> >>>
>  I invited everyone that replied to this thread and the people that
>  were on the last invite.
> 
>  If you have specific topics you'd like to put on the agenda, please
>  send them to me!
> 
>  On Sun, Mar 22, 2020 at 2:28 PM Ryan Blue 
>  wrote:
> 
> > Let's go with Wednesday. I'll send out an invite.
> >
> > On Sun, Mar 22, 2020 at 1:36 PM John Zhuge 
> > wrote:
> >
> >> 5-5:30 pm work for me. Prefer Wednesdays.
> >>
> >> On Sun, Mar 22, 2020 at 1:33 PM Romin Parekh <
> >> rominpar...@gmail.com> wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Both times slots work for me next week. Can we confirm a day?
> >>>
> >>> Thanks,
> >>> Romin
> >>>
> >>> Sent from my iPhone
> >>>
> >>> > On Mar 20, 2020, at 11:38 PM, Jun H. 
> >>> wrote:
> >>> >
> >>> > The schedule works for me.
> >>> >
> >>> >> On Thu, Mar 19, 2020 at 6:55 PM Junjie Chen <
> >>> chenjunjied...@gmail.com> wrote:
> >>> >>
> >>> >> The same time works for me as well.
> >>> >>
> >>> >>> On Fri, Mar 20, 2020 at 9:43 AM Gautam <
> >>> gautamkows...@gmail.com> wrote:
> >>> >>>
> >>> >>> 5 / 5:30pm any day of next week works for me.
> >>> >>>
> >>> >>> On Thu, Mar 19, 2020 at 6:07 PM 李响 
> >>> wrote:
> >>> 
> >>>  5 or 5:30 PM (UTC-7, is it PDT now) in any day works for
> >>> me. Looking forward to it 8-)
> >>> 
> >>>  On Fri, Mar 20, 2020 at 8:17 AM RD 
> >>> wrote:
> >>> >
> >>> > Same time works for me too!
> >>> >
> >>> > On Thu, Mar 19, 2020 at 4:45 PM Xabriel Collazo Mojica
> >>>  wrote:
> >>> >>
> >>> >> 5pm or 5:30pm PT  any day next week would work for me.
> >>> >>
> >>> >> Thanks for restoring the community sync up!
> >>> >>
> >>> >> Xabriel J Collazo Mojica  |  Sr Computer Scientist II  |
> >>> Adobe
> >>> >>
> >>> >> On 3/18/20, 6:45 PM, "justin_cof...@apple.com on behalf
> >>> of Justin Q Coffey"  >>> j...@apple.com.INVALID> wrote:
> >>> >>
> >>> >>Any chance we could actually do 5:30pm PST?  I'm a bit
> >>> of a lurker, but this roadmap is important to mine and I have a 
> 

Re: Iceberg/Hive properties handling

2020-12-07 Thread Jacques Nadeau
Hey Peter, thanks for updating the doc and your heads up in the other
thread on your capacity to look at this before EOY.

I'm going to try to create a specification document based on the discussion
document you put together. I think there is general consensus around what
you call "Spark-like catalog configuration" so I'd like to formalize that
more.

It seems like there is less consensus around the whitelist/blacklist side
of things. You outline four approaches:

   1. Hard coded HMS only property list
   2. Hard coded Iceberg only property list
   3. Prefix for Iceberg properties
   4. Prefix for HMS only properties

I generally think #2 is a no-go as it creates too much coupling between
catalog implementations and core iceberg. It seems like Ryan Blue would
prefer #4 (correct?). Any other strong opinions?
--
Jacques Nadeau
CTO and Co-Founder, Dremio


On Thu, Dec 3, 2020 at 9:27 AM Peter Vary 
wrote:

> As Jacques suggested (with the help of Zoltan) I have collected the
> current state and the proposed solutions in a document:
>
> https://docs.google.com/document/d/1KumHM9IKbQyleBEUHZDbeoMjd7n6feUPJ5zK8NQb-Qw/edit?usp=sharing
>
> My feeling is that we do not have a final decision, so tried to list all
> the possible solutions.
> Please comment!
>
> Thanks,
> Peter
>
> On Dec 2, 2020, at 18:10, Peter Vary  wrote:
>
> When I was working on the CREATE TABLE patch I found the following
> TBLPROPERTIES on newly created tables:
>
>- external.table.purge
>- EXTERNAL
>- bucketing_version
>- numRows
>- rawDataSize
>- totalSize
>- numFiles
>- numFileErasureCoded
>
>
> I am afraid that we can not change the name of most of these properties,
> and might not be useful to have most of them along with Iceberg statistics
> already there. Also my feeling is that this is only the top of the Iceberg
> (pun intended :)) so this is why I think we should be more targeted way to
> push properties to the Iceberg tables.
>
> On Dec 2, 2020, at 18:04, Ryan Blue  wrote:
>
> Sorry, I accidentally didn’t copy the dev list on this reply. Resending:
>
> Also I expect that we want to add Hive write specific configs to table
> level when the general engine independent configuration is not ideal for
> Hive, but every Hive query for a given table should use some specific
> config.
>
> Hive may need configuration, but I think these should still be kept in the
> Iceberg table. There is no reason to make Hive config inaccessible from
> other engines. If someone wants to view all of the config for a table from
> Spark, the Hive config should also be included right?
>
> On Tue, Dec 1, 2020 at 10:36 AM Peter Vary  wrote:
>
>> I will ask Laszlo if he wants to update his doc.
>>
>> I see both pros and cons of catalog definition in config files. If there
>> is an easy default then I do not mind any of the proposed solutions.
>>
>> OTOH I am in favor of the "use prefix for Iceberg table properties"
>> solution, because in Hive it is common to add new keys to the property list
>> - no restriction is in place (I am not even sure that the currently
>> implemented blacklist for preventing to propagate properties to Iceberg
>> tables is complete). Also I expect that we want to add Hive write specific
>> configs to table level when the general engine independent configuration is
>> not ideal for Hive, but every Hive query for a given table should use some
>> specific config.
>>
>> Thanks, Peter
>>
>> Jacques Nadeau  ezt írta (időpont: 2020. dec. 1., Ke
>> 17:06):
>>
>>> Would someone be willing to create a document that states the current
>>> proposal?
>>>
>>> It is becoming somewhat difficult to follow this thread. I also worry
>>> that without a complete statement of the current shape that people may be
>>> incorrectly thinking they are in alignment.
>>>
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <
>>> borokna...@cloudera.com> wrote:
>>>
 Thanks, Ryan. I answered inline.

 On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue  wrote:

> This sounds like a good plan overall, but I have a couple of notes:
>
>1. We need to keep in mind that users plug in their own catalogs,
>so iceberg.catalog could be a Glue or Nessie catalog, not just
>Hive or Hadoop. I don’t think it makes much sense to use separate
>hadoop.catalog and hive.catalog values. Those should just be names for
>catalogs configured in Configuration, i.e., via hive-site.xml. We
>then only need a special value for loading Hadoop tables from paths.
>
> About extensibility, I think the usual Hive way is to use Java class
 names. So this way the value for 'iceberg.catalog' could be e.g.
 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
 would need to have a factory method that constructs the catalog object from
 a properties object (Map). E.g.
 

Re: Sync notes for 2 December 2020

2020-12-07 Thread Peter Vary
Hi Team,

I will be OOO starting from the end of this week. If someone wants to pick up 
Hive config changes, feel free to grab it. I am afraid I will not have enough 
time to do it this year.

Thanks,
Peter

> On Dec 5, 2020, at 02:13, Ryan Blue  wrote:
> 
> Hi everyone,
> 
> I just wrote up my notes for the sync on Wednesday. Feel free to comment if 
> I’ve missed anything or didn’t remember clearly.
> 
> Here’s a quick summary of the main points:
> 
> Highlights: Spark stored procedures are available, Glue and Nessie catalogs 
> are committed, and Hive now supports DDL and projection pushdown! Thanks to 
> Anton, Jack, Ryan M., Peter, Christine, Marton, Adrian, and Adrien!
> We’re going to aim for an 0.11.0 release in the next couple weeks to avoid 
> waiting until January
> It would be nice to get the Hive config changes in if possible
> Will try to get MERGE INTO done in Spark (DELETE FROM was merged today!)
> Will also aim to get in Flink CDC writes, as well as Flink streaming reads
> Please speak up if you need something specific!
> Support for relative paths is a good idea, will probably add them to the spec
> The full notes are here: 
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit#heading=h.r951w2dduwmy
>  
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix