Re: Question about Iceberg release cadence

2020-08-27 Thread Saisai Shao
Would like to get structured streaming reader in in the next release :).
Will spend time on addressing new feedbacks.

Thanks
Saisai

Mass Dosage  于2020年8月27日周四 下午10:36写道:

> I'm all for a release. The only thing still required for basic Hive read
> support (other than documentation of course!) is producing a *single* jar
> that can be added to Hive's classpath, the PR for that is at
> https://github.com/apache/iceberg/pull/1267.
>
> Thanks,
>
> Adrian
>
> On Thu, 27 Aug 2020 at 01:26, Anton Okolnychyi
>  wrote:
>
>> +1 on releasing structured streaming source. I should be able to do one
>> more review round tomorrow.
>>
>> - Anton
>>
>> On 26 Aug 2020, at 17:12, Jungtaek Lim 
>> wrote:
>>
>> I hope we include Spark structured streaming read as well in the next
>> release; that was proposed in Feb this year and still around. Quoting my
>> comment on benefit of the streaming read on Spark;
>>
>> This would be the major feature to cover the gap on use case for
>>> structured streaming between Delta Lake and Iceberg. There's a technical
>>> limitation on Spark structured streaming itself (global watermark), which
>>> requires workaround via splitting query into multiple queries &
>>> intermediate storage supporting end-to-end exactly once. Delta Lake covers
>>> the case, and I really would like to see the case also covered by Iceberg.
>>> I see there're lots of works in progress on the milestone (and these are
>>> great features which should be done), but after this we cover both batch
>>> and streaming workloads being done with Spark, which is a huge step forward
>>> on Spark users.
>>
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Aug 27, 2020 at 1:13 AM Ryan Blue 
>> wrote:
>>
>>> Hi Marton,
>>>
>>> 0.9.0 was released about 6 weeks ago, so I don't think we've planned
>>> when the next release will be yet. I think it's a good idea to release
>>> soon, though. The Flink sink is close to being ready as well and I'd like
>>> to get both of those released so that the contributors can start using them.
>>>
>>> Seems like a good question for the broader community: how about a
>>> release in the next month or so for Hive reads and the Flink sink?
>>>
>>> rb
>>>
>>> On Wed, Aug 26, 2020 at 8:58 AM Marton Bod  wrote:
>>>
 Hi Team,

 I was wondering whether there is a release cadence already in place for
 Iceberg, e.g. how often releases will take place approximately? Which
 commits/features as release candidates in the near term?

 We're looking to integrate Iceberg into Hive, however, the current
 0.9.1 release does not yet contain the StorageHandler code in iceberg-mr.
 Knowing the approximate release timelines would help greatly with our
 integration planning.

 Of course, happy to get involved with ongoing dev/stability efforts to
 help achieve a new release of this module.

 Thanks a lot,
 Marton

>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>>


Re: Hive Iceberg writes

2020-08-27 Thread RD
Our stance has been similar at LinkedIn. Hive writes are not a priority for
us as we plan to move more and more of our workloads on Hive to Spark SQL

-R

On Thu, Aug 27, 2020 at 10:18 AM Edgar Rodriguez
 wrote:

> Hi folks,
>
> We have not started to work on this either, but we've discussed this
> internally on whether supporting Hive writes or not. Our first priority
> right now is getting Hive reads in production to have read compatibility
> with our existing Hive clients. We'd be interested in this, however, at
> Airbnb we're moving to Spark so writes in Hive most likely won't be on top
> of our list.
>
> Thanks!
>
> Cheers,
>
> On Thu, Aug 27, 2020 at 12:53 AM Mass Dosage  wrote:
>
>> We're definitely interested in this too but haven't started work on it
>> yet. It has been discussed at our community syncs as something quite a few
>> people are interested in so if nobody else responds a good starting point
>> would probably be an early WIP PR that everyone can follow and contribute
>> to.
>>
>> Thanks,
>>
>> Adrian
>>
>> On Wed, 26 Aug 2020 at 17:35, Ryan Blue 
>> wrote:
>>
>>> I think Edgar and Adrien who have been contributing support for ORC and
>>> Hive are interested in this as well.
>>>
>>> On Wed, Aug 26, 2020 at 9:22 AM Peter Vary 
>>> wrote:
>>>
 Hi Team,

 We are thinking about implementing HiveOutputFormat, so writes through
 Hive can work as well.
 Has anybody working on this? Do you know any ongoing effort related to
 Hive writes?
 Asking because we would like to prevent duplicate effort.
 Also if anyone has some good pointers to start for an Iceberg noobie,
 it would be good.

 Thanks,
 Peter


>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>
>
> --
> Edgar R
>


Re: Hive Iceberg writes

2020-08-27 Thread Edgar Rodriguez
Hi folks,

We have not started to work on this either, but we've discussed this
internally on whether supporting Hive writes or not. Our first priority
right now is getting Hive reads in production to have read compatibility
with our existing Hive clients. We'd be interested in this, however, at
Airbnb we're moving to Spark so writes in Hive most likely won't be on top
of our list.

Thanks!

Cheers,

On Thu, Aug 27, 2020 at 12:53 AM Mass Dosage  wrote:

> We're definitely interested in this too but haven't started work on it
> yet. It has been discussed at our community syncs as something quite a few
> people are interested in so if nobody else responds a good starting point
> would probably be an early WIP PR that everyone can follow and contribute
> to.
>
> Thanks,
>
> Adrian
>
> On Wed, 26 Aug 2020 at 17:35, Ryan Blue  wrote:
>
>> I think Edgar and Adrien who have been contributing support for ORC and
>> Hive are interested in this as well.
>>
>> On Wed, Aug 26, 2020 at 9:22 AM Peter Vary 
>> wrote:
>>
>>> Hi Team,
>>>
>>> We are thinking about implementing HiveOutputFormat, so writes through
>>> Hive can work as well.
>>> Has anybody working on this? Do you know any ongoing effort related to
>>> Hive writes?
>>> Asking because we would like to prevent duplicate effort.
>>> Also if anyone has some good pointers to start for an Iceberg noobie, it
>>> would be good.
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Edgar R


Re: Question about Iceberg release cadence

2020-08-27 Thread Mass Dosage
I'm all for a release. The only thing still required for basic Hive read
support (other than documentation of course!) is producing a *single* jar
that can be added to Hive's classpath, the PR for that is at
https://github.com/apache/iceberg/pull/1267.

Thanks,

Adrian

On Thu, 27 Aug 2020 at 01:26, Anton Okolnychyi
 wrote:

> +1 on releasing structured streaming source. I should be able to do one
> more review round tomorrow.
>
> - Anton
>
> On 26 Aug 2020, at 17:12, Jungtaek Lim 
> wrote:
>
> I hope we include Spark structured streaming read as well in the next
> release; that was proposed in Feb this year and still around. Quoting my
> comment on benefit of the streaming read on Spark;
>
> This would be the major feature to cover the gap on use case for
>> structured streaming between Delta Lake and Iceberg. There's a technical
>> limitation on Spark structured streaming itself (global watermark), which
>> requires workaround via splitting query into multiple queries &
>> intermediate storage supporting end-to-end exactly once. Delta Lake covers
>> the case, and I really would like to see the case also covered by Iceberg.
>> I see there're lots of works in progress on the milestone (and these are
>> great features which should be done), but after this we cover both batch
>> and streaming workloads being done with Spark, which is a huge step forward
>> on Spark users.
>
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Aug 27, 2020 at 1:13 AM Ryan Blue 
> wrote:
>
>> Hi Marton,
>>
>> 0.9.0 was released about 6 weeks ago, so I don't think we've planned when
>> the next release will be yet. I think it's a good idea to release soon,
>> though. The Flink sink is close to being ready as well and I'd like to get
>> both of those released so that the contributors can start using them.
>>
>> Seems like a good question for the broader community: how about a release
>> in the next month or so for Hive reads and the Flink sink?
>>
>> rb
>>
>> On Wed, Aug 26, 2020 at 8:58 AM Marton Bod  wrote:
>>
>>> Hi Team,
>>>
>>> I was wondering whether there is a release cadence already in place for
>>> Iceberg, e.g. how often releases will take place approximately? Which
>>> commits/features as release candidates in the near term?
>>>
>>> We're looking to integrate Iceberg into Hive, however, the current 0.9.1
>>> release does not yet contain the StorageHandler code in iceberg-mr. Knowing
>>> the approximate release timelines would help greatly with our integration
>>> planning.
>>>
>>> Of course, happy to get involved with ongoing dev/stability efforts to
>>> help achieve a new release of this module.
>>>
>>> Thanks a lot,
>>> Marton
>>>
>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>
>


Re: Question about logging

2020-08-27 Thread Peter Vary
Hi Ryan,

Thanks for the detailed answer, and the confirmation! Created the pull request:

https://github.com/apache/iceberg/pull/1394 

Hive: Add logging for Hive related classes #1394

 
One comment for the message below: One place where DEBUG level can be useful, 
if an issue is only reproducible on customer side they can turn on DEBUG 
logging and share the logs - connecting with a debugger is not always desirable.

I have seen multiline log messages in Iceberg - this could cause hard to read 
logs if there are plenty of concurrent users and the log lines start to mix-up. 
I suggest collapsing them to a single line.

Thanks,
Peter


> On Aug 26, 2020, at 18:33, Ryan Blue  wrote:
> 
> Hi Peter,
> 
> Thanks for thinking about this! Improving logs is a great contribution.
> 
> My philosophy is to stick to the usual definitions of logging levels. Here’s 
> my quick summary:
> 
> FATAL: the event that caused the service to stop (not used in Iceberg, since 
> it’s a library)
> ERROR: an event that stops an operation
> WARNING: an unexpected event where the operation can continue but there may 
> be a bigger problem
> INFO: events that help you understand what’s happening in normal operation
> DEBUG: events that help you spot bugs by looking for unexpected values
> In general, I think that we can do better with INFO logging. But there is a 
> balance to target because we don’t want Iceberg to be so verbose that people 
> turn the logging off. We have to remember that Iceberg is typically 
> integrated into some other framework that is more important. Iceberg should 
> log the messages that help someone understand what the engine is doing, 
> rather than logging internal concerns. A good example of what not to do comes 
> from Parquet, where INFO messages were printed with stats from each row group 
> — this was too much because Parquet’s contribution to the overall picture of 
> what the engine was doing was just that it was storing data at the file 
> level. Parquet internal concerns should not have leaked.
> 
> I also don’t typically like to use DEBUG logging because you almost never 
> keep those logs. For debugging, I prefer to use a debugger and keep the code 
> clean. I think you could argue that DEBUG is appropriate for cases like my 
> Parquet example above, where INFO isn’t a good idea for the final 
> application, but you may still want to log information about the operations 
> in a library. I think that’s fine, just not things like logging the value of 
> arguments to functions all over the codebase. I think the use of DEBUG for 
> files that are being opened sounds reasonable.
> 
> I think the additional logs you’re proposing mostly look good. The scan 
> configuration is logged in Iceberg’s BaseTableScan, so you should already get 
> snapshot ID and filters. I’m not sure that we keep the time. We also have a 
> system for emitting events with more detail. Those can be used for additional 
> logging if you choose. We send them through a Kafka pipeline so we can 
> analyze all the scans taking place in our data warehouse.
> 
> rb
> 
> 
> On Wed, Aug 26, 2020 at 6:24 AM Peter Vary  wrote:
> Hi Team,
> 
> I was wondering if there is a general best practice for using log levels in 
> Iceberg, what is the usual way we do it.
> 
> I have been playing around with Iceberg/Hive integration and was wondering 
> how I would be able to debug a customer case based on the logs.
> To be the honest answer based on the current code is most probably using some 
> additional method to generate new log lines, which I think is not a good 
> situation.
> 
> I was thinking about adding some more logging to the code and added the 
> proposed log level next to it:
> CatalogLoader/Catalogs:
> Which catalog is used to fetch the table - INFO
> What is the table being fetched in the end - INFO
> HiveCatalog:
> Table is created/dropped/renamed - INFO
> Catalog is created/dropped - INFO
> HiveTableOperation:
> Write committed / table created - INFO 
> IcebergInputFormat:
> Scan parameters provided - DEBUG
> SnapshotId
> Time
> Filters
> Final snapshotId being read - INFO (if I find a way to get this)
> Opening specific files and their format - DEBUG
> 
> Are the logs above are compliant with the current practice? Shall I modify 
> them up or down? Any places around Iceberg/Hive you think it would be good to 
> add logging?
> 
> Thanks,
> Peter
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix



Re: Hive Iceberg writes

2020-08-27 Thread Mass Dosage
We're definitely interested in this too but haven't started work on it yet.
It has been discussed at our community syncs as something quite a few
people are interested in so if nobody else responds a good starting point
would probably be an early WIP PR that everyone can follow and contribute
to.

Thanks,

Adrian

On Wed, 26 Aug 2020 at 17:35, Ryan Blue  wrote:

> I think Edgar and Adrien who have been contributing support for ORC and
> Hive are interested in this as well.
>
> On Wed, Aug 26, 2020 at 9:22 AM Peter Vary 
> wrote:
>
>> Hi Team,
>>
>> We are thinking about implementing HiveOutputFormat, so writes through
>> Hive can work as well.
>> Has anybody working on this? Do you know any ongoing effort related to
>> Hive writes?
>> Asking because we would like to prevent duplicate effort.
>> Also if anyone has some good pointers to start for an Iceberg noobie, it
>> would be good.
>>
>> Thanks,
>> Peter
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>