Query about the semantics of "overwrite" in Iceberg

2019-11-21 Thread Saisai Shao
Hi Team,

I found that Iceberg's "overwrite" is different from Spark's built-in
sources like Parquet. The "overwrite" semantics in Iceberg seems more like
"upsert", but not deleting the partitions where new data doesn't contain.

I would like to know what is the purpose of such design choice? Also if I
want to achieve Spark Parquet's "overwrite" semantics, how would I
achieve this?

Warning

*Spark does not define the behavior of DataFrame overwrite*. Like most
sources, Iceberg will dynamically overwrite partitions when the dataframe
contains rows in a partition. Unpartitioned tables are completely
overwritten.

Best regards,
Saisai


Re: Iceberg in Spark 3.0.0

2019-11-21 Thread Saisai Shao
Hi Ryan and team,

Thanks a lot for your response. I was wondering how do we share our branch,
one possible way s that we maintain a forked Iceberg repo with Spark
3.0.0-preview branch, another possible way is to create a branch in
upstream Iceberg repo. I'm inclined to choose the second way, so that they
community could review and contribute on it.

I would like to hear your suggestions.

Best regards,
Saisai


Ryan Blue  于2019年11月20日周三 上午1:27写道:

> Sounds great, thanks Saisai!
>
> On Mon, Nov 18, 2019 at 3:29 AM Saisai Shao 
> wrote:
>
>> Thanks Anton, I will share our branch soon.
>>
>> Best regards,
>> Saisai
>>
>> Anton Okolnychyi  于2019年11月18日周一 下午6:54写道:
>>
>>> I think it would be great if you can share what you have, Saisai. That
>>> way, we can all collaborate and ensure we build a full 3.0 integration as
>>> soon as possible.
>>>
>>> - Anton
>>>
>>>
>>> On 18 Nov 2019, at 02:08, Saisai Shao  wrote:
>>>
>>> Hi Anton,
>>>
>>> Thanks to bring this out. We already have a branch building against
>>> Spark 3.0 (Master branch actually) internally, and we're actively working
>>> on it. I think it is a good idea to create an upstream Spark 3.0 branch, we
>>> could share it if the community would like to do so.
>>>
>>> Best regards,
>>> Saisai
>>>
>>> Anton Okolnychyi  于2019年11月18日周一
>>> 上午1:40写道:
>>>
 I think it is a good time to create a branch to build our 3.0
 integration as the 3.0 preview was released.
 What does everyone think? Has anybody started already?

 - Anton

 On 8 Aug 2019, at 23:47, Edgar Rodriguez 
 wrote:



 On Thu, Aug 8, 2019 at 3:37 PM Ryan Blue  wrote:

> I think it's a great idea to branch and get ready for Spark 3.0.0.
> Right now, I'm focused on getting a release out, but I can review patches
> for Spark 3.0.
>
> Anyone know if there are nightly builds of Spark 3.0 to test with?
>

 Seems like there're nightly snapshots built in
 https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-sql_2.12/3.0.0-SNAPSHOT/
  -
 I've started setting something up with these snapshots so I can probably
 start working on this.

 Thanks!

 Cheers,
 --
 Edgar Rodriguez



>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: Queries on iceberg

2019-11-21 Thread Ryan Blue
Hi Sandeep, thanks for your interest in Iceberg.

> Do iceberg supports external hive table ? if yes  does it supports all
file systems like hdfs, s3 , wasb/adfs?

I'm not quite sure what you mean because Iceberg replaces Hive tables and
is not compatible with them. Sounds like you might be wondering about how
files are accessed and their life-cycle.

Iceberg uses Hadoop FileSystem to read and write files, so as long as you
have a configured FileSystem, you can use any of those paths. Iceberg
doesn't have a concept of "external" data. Iceberg expects to manage all of
the files underneath it and will delete files as you remove snapshots that
track deleted files (logical deletes happen first, physical deletes later).
You can avoid the deletes by passing in a callback to manage removal
yourself. When you drop a table, there's a flag for whether the data files
should be removed. We wanted to make sure there is flexibility here for
platform teams to be able to clean up data as they need to. For example,
our users never delete data; we use Janitor services to clean up old
partitions, snapshots, and dangling files.

> Can we migrate externally created hive tables to iceberg tables without
deleting our existing data on s3?

There is an import utility you can use to create Iceberg metadata around
files in an existing Hive table:
https://github.com/apache/incubator-iceberg/blob/master/spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala#L479-L483

> Does iceberg have CLI support for DDL and DML queries?

Iceberg doesn't directly provide SQL support. For that, we have integration
with Spark and Presto.

Presto supports

"create,
CTAS, drop, rename, and reading from Iceberg tables. It also supports
adding, dropping, and renaming columns."

Spark 2.4 supports  only the DataFrame
API, and SQL support is coming in 3.0.

> if Iceberg supports external tables does it support ORC file format also?

Iceberg doesn't support ORC yet, but there is a pull request for it that is
getting really close to merging. I'm currently trying to make time to
review it.

On Wed, Nov 20, 2019 at 1:56 AM Sandeep Kumar  wrote:

> All,
>
> I am new to iceberg and want to explore iceberg to optimise hive query
> response time.
> I have couple of questions regarding same.
> 1.) Do iceberg supports external hive table ? if yes  does it supports all
> file systems like hdfs, s3 , wasb/adfs?
> 2) Can we migrate externally created hive tables to iceberg tables without
> deleting our existing data on s3?
> 3) Does iceberg have cli support for DDL and DML queries?
> 4) if Iceberg supports external tables does it support ORC file format
> also?
>
>
> Are there and code snippet examples for migrating hive tables to iceberg?
>
> Your help is much appreciated.
>
> Best Regards,
> Sandeep
>


-- 
Ryan Blue
Software Engineer
Netflix