Re: Iceberg/Hive properties handling

Peter Vary Tue, 01 Dec 2020 10:36:38 -0800

I will ask Laszlo if he wants to update his doc.

I see both pros and cons of catalog definition in config files. If there is
an easy default then I do not mind any of the proposed solutions.


OTOH I am in favor of the "use prefix for Iceberg table properties"
solution, because in Hive it is common to add new keys to the property list
- no restriction is in place (I am not even sure that the currently
implemented blacklist for preventing to propagate properties to Iceberg
tables is complete). Also I expect that we want to add Hive write specific
configs to table level when the general engine independent configuration is
not ideal for Hive, but every Hive query for a given table should use some
specific config.

Thanks, Peter

Jacques Nadeau <jacq...@dremio.com> ezt írta (időpont: 2020. dec. 1., Ke
17:06):

> Would someone be willing to create a document that states the current
> proposal?
>
> It is becoming somewhat difficult to follow this thread. I also worry that
> without a complete statement of the current shape that people may be
> incorrectly thinking they are in alignment.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
>
> On Tue, Dec 1, 2020 at 5:32 AM Zoltán Borók-Nagy <borokna...@cloudera.com>
> wrote:
>
>> Thanks, Ryan. I answered inline.
>>
>> On Mon, Nov 30, 2020 at 8:26 PM Ryan Blue <rb...@netflix.com> wrote:
>>
>>> This sounds like a good plan overall, but I have a couple of notes:
>>>
>>>    1. We need to keep in mind that users plug in their own catalogs, so
>>>    iceberg.catalog could be a Glue or Nessie catalog, not just Hive or
>>>    Hadoop. I don’t think it makes much sense to use separate hadoop.catalog
>>>    and hive.catalog values. Those should just be names for catalogs 
>>> configured
>>>    in Configuration, i.e., via hive-site.xml. We then only need a
>>>    special value for loading Hadoop tables from paths.
>>>
>>> About extensibility, I think the usual Hive way is to use Java class
>> names. So this way the value for 'iceberg.catalog' could be e.g.
>> 'org.apache.iceberg.hive.HiveCatalog'. Then each catalog implementation
>> would need to have a factory method that constructs the catalog object from
>> a properties object (Map<String, String>). E.g.
>> 'org.apache.iceberg.hadoop.HadoopCatalog' would require
>> 'iceberg.catalog_location' to be present in properties.
>>
>>>
>>>    1. I don’t think that catalog configuration should be kept in table
>>>    properties. A catalog should not be loaded for each table. So I don’t 
>>> think
>>>    we need iceberg.catalog_location. Instead, we should have a way to
>>>    define catalogs in the Configuration for tables in the metastore to
>>>    reference.
>>>
>>>  I think it makes sense, on the other hand it would make adding new
>> catalogs more heavy-weight, i.e. now you'd need to edit configuration files
>> and restart/reinit services. Maybe it can be cumbersome in some
>> environments.
>>
>>>
>>>    1. I’d rather use a prefix to exclude properties from being passed
>>>    to Iceberg than to include them. Otherwise, users don’t know what to do 
>>> to
>>>    pass table properties from Hive or Impala. If we exclude a prefix or
>>>    specific properties, then everything but the properties reserved for
>>>    locating the table are passed as the user would expect.
>>>
>>> I don't have a strong opinion about this, but yeah, maybe this behavior
>> would cause the least surprises.
>>
>>>
>>>
>>>
>>> On Mon, Nov 30, 2020 at 7:51 AM Zoltán Borók-Nagy <borokna...@apache.org>
>>> wrote:
>>>
>>>> Thanks, Peter. I answered inline.
>>>>
>>>> On Mon, Nov 30, 2020 at 3:13 PM Peter Vary <pv...@cloudera.com.invalid>
>>>> wrote:
>>>>
>>>>> Hi Zoltan,
>>>>>
>>>>> Answers below:
>>>>>
>>>>> On Nov 30, 2020, at 14:19, Zoltán Borók-Nagy <
>>>>> borokna...@cloudera.com.INVALID> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for the replies. My take for the above questions are as follows
>>>>>
>>>>>    - Should 'iceberg.catalog' be a required property?
>>>>>    - Yeah, I think it would be nice if this would be required to
>>>>>       avoid any implicit behavior
>>>>>
>>>>> Currently we have a Catalogs class to get/initialize/use the different
>>>>> Catalogs. At that time the decision was to use HadoopTables as a default
>>>>> catalog.
>>>>> It might be worthwhile to use the same class in Impala as well, so the
>>>>> behavior is consistent.
>>>>>
>>>>
>>>> Yeah, I think it'd be beneficial for us to use the Iceberg classes
>>>> whenever possible. The Catalogs class is very similar to what we have
>>>> currently in Impala.
>>>>
>>>>>
>>>>>    - 'hadoop.catalog' LOCATION and catalog_location
>>>>>       - In Impala we don't allow setting LOCATION for tables stored
>>>>>       in 'hadoop.catalog'. But Impala internally sets LOCATION to the 
>>>>> Iceberg
>>>>>       table's actual location. We were also thinking about using only the 
>>>>> table
>>>>>       LOCATION, and set it to the catalog location, but we also found it
>>>>>       confusing.
>>>>>
>>>>> It could definitely work, but it is somewhat strange that we have an
>>>>> external table location set to an arbitrary path, and we have a different
>>>>> location generated by other configs. It would be nice to have the real
>>>>> location set in the external table location as well.
>>>>>
>>>>
>>>> Impala sets the real Iceberg table location for external tables. E.g.
>>>> if the user issues
>>>>
>>>> CREATE EXTERNAL TABLE my_hive_db.iceberg_table_hadoop_catalog
>>>> STORED AS ICEBERG
>>>> TBLPROPERTIES('iceberg.catalog'='hadoop.catalog',
>>>>               'iceberg.catalog_location'='/path/to/hadoop/catalog',
>>>>               'iceberg.table_identifier'='namespace1.namespace2.ice_t');
>>>>
>>>> If the end user had specified LOCATION, then Impala would have raised
>>>> an error. But the above DDL statement is correct, so Impala loads the
>>>> iceberg table via Iceberg API, then creates the HMS table and sets LOCATION
>>>> to the Iceberg table location (something like
>>>> /path/to/hadoop/catalog/namespace1/namespace2/ice_t).
>>>>
>>>>
>>>>> I like the flexibility of setting the table_identifier on table level,
>>>>> which could help removing naming conflicts. We might want to have this in
>>>>> the Iceberg Catalog implementation.
>>>>>
>>>>>
>>>>>    - 'iceberg.table_identifier' for HiveCatalog
>>>>>       - Yeah, it doesn't add much if we only allow using the current
>>>>>       HMS. I think it can be only useful if we are allowing external 
>>>>> HMSes.
>>>>>    - Moving properties to SERDEPROPERTIES
>>>>>       - I see that these properties are used by the SerDe classes in
>>>>>       Hive, but I feel that these properties are just not about 
>>>>> serialization and
>>>>>       deserialization. And as I see the current SERDEPROPERTIES are 
>>>>> things like
>>>>>       'field.delim', 'separatorChar', 'quoteChar', etc. So properties 
>>>>> about table
>>>>>       loading more naturally belong to TBLPROPERTIES in my opinion.
>>>>>
>>>>> I have seen it used both ways for HBaseSerDe. (even the wiki page uses
>>>>> both :) ). Since Impala prefers TBLPROPERTIES and if we start using prefix
>>>>> for separating real Iceberg table properties from other properties, then 
>>>>> we
>>>>> can keep it at TBLPROPERTIES.
>>>>>
>>>>
>>>> In the google doc I also had a comment about prefixing iceberg table
>>>> properties. We could use a prefix like 'iceberg.tblproperties.', and pass
>>>> every property with this prefix to the Iceberg table. Currently Impala
>>>> passes every table property to the Iceberg table.
>>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>>     Zoltan
>>>>>
>>>>>
>>>>> On Mon, Nov 30, 2020 at 1:33 PM Peter Vary <pv...@cloudera.com.invalid>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Based on the discussion below I understand we have the following
>>>>>> kinds of properties:
>>>>>>
>>>>>>    1. Iceberg table properties - Engine independent, storage related
>>>>>>    parameters
>>>>>>    2. "how to get to" - I think these are mostly Hive table specific
>>>>>>    properties, since for Spark, the Spark catalog configuration serves 
>>>>>> for the
>>>>>>    same purpose. I think the best place for storing these would be the
>>>>>>    Hive SERDEPROPERTIES, as this describes the access information for the
>>>>>>    SerDe. Sidenote: I think we should decide if we allow
>>>>>>    HiveCatalogs pointing to a different HMS and the 
>>>>>> 'iceberg.table_identifier'
>>>>>>    would make sense only if we allow having multiple catalogs.
>>>>>>    3. Query specific properties - These are engine specific and
>>>>>>    might be mapped to / even override the Iceberg table properties on the
>>>>>>    engine specific code paths, but currently these properties have 
>>>>>> independent
>>>>>>    names and mapped on a case-by-case basis.
>>>>>>
>>>>>>
>>>>>> Based on this:
>>>>>>
>>>>>>    - Shall we move the "how to get to" properties to SERDEPROPERTIES?
>>>>>>    - Shall we define a prefix for setting Iceberg table properties
>>>>>>    from Hive queries and omitting other engine specific properties?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> On Nov 27, 2020, at 17:45, Mass Dosage <massdos...@gmail.com> wrote:
>>>>>>
>>>>>> I like these suggestions, comments inline below on the last round...
>>>>>>
>>>>>> On Thu, 26 Nov 2020 at 09:45, Zoltán Borók-Nagy <
>>>>>> borokna...@apache.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> The above aligns with what we did in Impala, i.e. we store
>>>>>>> information about table loading in HMS table properties. We are just a 
>>>>>>> bit
>>>>>>> more explicit about which catalog to use.
>>>>>>> We have table property 'iceberg.catalog' to determine the catalog
>>>>>>> type, right now the supported values are 'hadoop.tables', 
>>>>>>> 'hadoop.catalog',
>>>>>>> and 'hive.catalog'. Additional table properties can be set based on the
>>>>>>> catalog type.
>>>>>>>
>>>>>>> So, if the value of 'iceberg.catalog' is
>>>>>>>
>>>>>>
>>>>>> I'm all for renaming this, having "mr" in the property name is
>>>>>> confusing.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>    - hadoop.tables
>>>>>>>       - the table location is used to load the table
>>>>>>>
>>>>>>> The only question I have is should we have this as the default? i.e.
>>>>>> if you don't set a catalog it will assume its HadoopTables and use the
>>>>>> location? Or should we require this property to be here to be consistent
>>>>>> and avoid any "magic"?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>    - hadoop.catalog
>>>>>>>       - Required table property 'iceberg.catalog_location'
>>>>>>>       specifies the location of the hadoop catalog in the file system
>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>       specifies the table id. If it's not set, then 
>>>>>>> <database_name>.<table_name>
>>>>>>>       is used as table identifier
>>>>>>>
>>>>>>> I like this as it would allow you to use a different database and
>>>>>> table name in Hive as opposed to the Hadoop Catalog - at the moment they
>>>>>> have to match. The only thing here is that I think Hive requires a table
>>>>>> LOCATION to be set and it's then confusing as there are now two locations
>>>>>> on the table. I'm not sure whether in the Hive storage handler or SerDe
>>>>>> etc. we can get Hive to not require that and maybe even disallow it from
>>>>>> being set. That would probably be best in conjunction with this. Another
>>>>>> solution would be to not have the 'iceberg.catalog_location' property but
>>>>>> instead use the table LOCATION for this but that's a bit confusing from a
>>>>>> Hive point of view.
>>>>>>
>>>>>>
>>>>>>>    - hive.catalog
>>>>>>>       - Optional table property 'iceberg.table_identifier'
>>>>>>>       specifies the table id. If it's not set, then 
>>>>>>> <database_name>.<table_name>
>>>>>>>       is used as table identifier
>>>>>>>       - We have the assumption that the current Hive metastore
>>>>>>>       stores the table, i.e. we don't support external Hive metastores 
>>>>>>> currently
>>>>>>>
>>>>>>> These sound fine for Hive catalog tables that are created outside of
>>>>>> the automatic Hive table creation (see
>>>>>> https://iceberg.apache.org/hive/ -> Using Hive Catalog) we'd just
>>>>>> need to document how you can create these yourself and that one could 
>>>>>> use a
>>>>>> different Hive database and table etc.
>>>>>>
>>>>>>
>>>>>>> Independent of catalog implementations, but we also have table
>>>>>>> property 'iceberg.file_format' to specify the file format for the data
>>>>>>> files.
>>>>>>>
>>>>>>
>>>>>> OK, I don't think we need that for Hive?
>>>>>>
>>>>>>
>>>>>>> We haven't released it yet, so we are open to changes, but I think
>>>>>>> these properties are reasonable and it would be great if we could
>>>>>>> standardize the properties across engines that use HMS as the primary
>>>>>>> metastore of tables.
>>>>>>>
>>>>>>>
>>>>>> If others agree I think we should create an issue where we document
>>>>>> the above changes so it's very clear what we're doing and can then go an
>>>>>> implement them and update the docs etc.
>>>>>>
>>>>>>
>>>>>>> Cheers,
>>>>>>>     Zoltan
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Yes, I think that is a good summary of the principles.
>>>>>>>>
>>>>>>>> #4 is correct because we provide some information that is
>>>>>>>> informational (Hive schema) or tracked only by the metastore 
>>>>>>>> (best-effort
>>>>>>>> current user). I also agree that it would be good to have a table
>>>>>>>> identifier in HMS table metadata when loading from an external table. 
>>>>>>>> That
>>>>>>>> gives us a way to handle name conflicts.
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Minor error, my last example should have been:
>>>>>>>>>
>>>>>>>>> db1.table1_etl_branch =>
>>>>>>>>> nessie.folder1.folder2.folder3.table1@etl_branch
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Jacques Nadeau
>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I agree with Ryan on the core principles here. As I understand
>>>>>>>>>> them:
>>>>>>>>>>
>>>>>>>>>>    1. Iceberg metadata describes all properties of a table
>>>>>>>>>>    2. Hive table properties describe "how to get to" Iceberg
>>>>>>>>>>    metadata (which catalog + possibly ptr, path, token, etc)
>>>>>>>>>>    3. There could be default "how to get to" information set at
>>>>>>>>>>    a global level
>>>>>>>>>>    4. Best-effort schema should stored be in the table
>>>>>>>>>>    properties in HMS. This should be done for information schema 
>>>>>>>>>> retrieval
>>>>>>>>>>    purposes within Hive but should be ignored during Hive/other tool 
>>>>>>>>>> execution.
>>>>>>>>>>
>>>>>>>>>> Is that a fair summary of your statements Ryan (except 4, which I
>>>>>>>>>> just added)?
>>>>>>>>>>
>>>>>>>>>> One comment I have on #2 is that for different catalogs and use
>>>>>>>>>> cases, I think it can be somewhat more complex where it would be
>>>>>>>>>> desirable for a table that initially existed without Hive that was 
>>>>>>>>>> later
>>>>>>>>>> exposed in Hive to support a ptr/path/token for how the table is 
>>>>>>>>>> named
>>>>>>>>>> externally. For example, in a Nessie context we support arbitrary 
>>>>>>>>>> paths for
>>>>>>>>>> an Iceberg table (such as folder1.folder2.folder3.table1). If you 
>>>>>>>>>> then want
>>>>>>>>>> to expose that table to Hive, you might have this mapping for #2
>>>>>>>>>>
>>>>>>>>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>>>>>>>>
>>>>>>>>>> Similarly, you might want to expose a particular branch version
>>>>>>>>>> of a table. So it might say:
>>>>>>>>>>
>>>>>>>>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>>>>>>>>
>>>>>>>>>> Just saying that the address to the table in the catalog could
>>>>>>>>>> itself have several properties. The key being that no matter what 
>>>>>>>>>> those
>>>>>>>>>> are, we should follow #1 and only store properties that are about 
>>>>>>>>>> the ptr,
>>>>>>>>>> not the content/metadata.
>>>>>>>>>>
>>>>>>>>>> Lastly, I believe #4 is the case but haven't tested it. Can
>>>>>>>>>> someone confirm that it is true? And that it is possible/not 
>>>>>>>>>> problematic?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Jacques Nadeau
>>>>>>>>>> CTO and Co-Founder, Dremio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <
>>>>>>>>>> rb...@netflix.com.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks for working on this, Laszlo. I’ve been thinking about
>>>>>>>>>>> these problems as well, so this is a good time to have a discussion 
>>>>>>>>>>> about
>>>>>>>>>>> Hive config.
>>>>>>>>>>>
>>>>>>>>>>> I think that Hive configuration should work mostly like other
>>>>>>>>>>> engines, where different configurations are used for different 
>>>>>>>>>>> purposes.
>>>>>>>>>>> Different purposes means that there is not a global configuration 
>>>>>>>>>>> priority.
>>>>>>>>>>> Hopefully, I can explain how we use the different config sources 
>>>>>>>>>>> elsewhere
>>>>>>>>>>> to clarify.
>>>>>>>>>>>
>>>>>>>>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a
>>>>>>>>>>> Hadoop Configuration, but it also has its own global configuration. 
>>>>>>>>>>> There
>>>>>>>>>>> are also Iceberg table properties, and all of the various Hive 
>>>>>>>>>>> properties
>>>>>>>>>>> if you’re tracking tables with a Hive MetaStore.
>>>>>>>>>>>
>>>>>>>>>>> The first step is to simplify where we can, so we effectively
>>>>>>>>>>> eliminate 2 sources of config:
>>>>>>>>>>>
>>>>>>>>>>>    - The Hadoop Configuration is only used to instantiate
>>>>>>>>>>>    Hadoop classes, like FileSystem. Iceberg should not use it for 
>>>>>>>>>>> any other
>>>>>>>>>>>    config.
>>>>>>>>>>>    - Config in the Hive MetaStore is only used to identify that
>>>>>>>>>>>    a table is Iceberg and point to its metadata location. All other 
>>>>>>>>>>> config in
>>>>>>>>>>>    HMS is informational. For example, the input format is 
>>>>>>>>>>> FileInputFormat so
>>>>>>>>>>>    that non-Iceberg readers cannot actually instantiate the format 
>>>>>>>>>>> (it’s
>>>>>>>>>>>    abstract) but it is available so they also don’t fail trying to 
>>>>>>>>>>> load the
>>>>>>>>>>>    class. Table-specific config should not be stored in table or 
>>>>>>>>>>> serde
>>>>>>>>>>>    properties.
>>>>>>>>>>>
>>>>>>>>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>>>>>>>>
>>>>>>>>>>> Iceberg differs from other tables because it is opinionated:
>>>>>>>>>>> data configuration should be maintained at the table level. This is 
>>>>>>>>>>> cleaner
>>>>>>>>>>> for users because config is standardized across engines and in one 
>>>>>>>>>>> place.
>>>>>>>>>>> And it also enables services that analyze a table and update its
>>>>>>>>>>> configuration to tune options that users almost never do, like row 
>>>>>>>>>>> group or
>>>>>>>>>>> stripe size in the columnar formats. Iceberg table configuration is 
>>>>>>>>>>> used to
>>>>>>>>>>> configure table-specific concerns and behavior.
>>>>>>>>>>>
>>>>>>>>>>> Spark configuration is used for engine-specific concerns, and
>>>>>>>>>>> runtime overrides. A good example of an engine-specific concern is 
>>>>>>>>>>> the
>>>>>>>>>>> catalogs that are available to load Iceberg tables. Spark has a way 
>>>>>>>>>>> to load
>>>>>>>>>>> and configure catalog implementations and Iceberg uses that for all
>>>>>>>>>>> catalog-level config. Runtime overrides are things like target 
>>>>>>>>>>> split size.
>>>>>>>>>>> Iceberg has a table-level default split size in table properties, 
>>>>>>>>>>> but this
>>>>>>>>>>> can be overridden by a Spark option for each table, as well as an 
>>>>>>>>>>> option
>>>>>>>>>>> passed to the individual read. Note that these necessarily have 
>>>>>>>>>>> different
>>>>>>>>>>> config names for how they are used: Iceberg uses
>>>>>>>>>>> read.split.target-size and the read-specific option is
>>>>>>>>>>> target-size.
>>>>>>>>>>>
>>>>>>>>>>> Applying this to Hive is a little strange for a couple reasons.
>>>>>>>>>>> First, Hive’s engine configuration *is* a Hadoop Configuration.
>>>>>>>>>>> As a result, I think the right place to store engine-specific 
>>>>>>>>>>> config is
>>>>>>>>>>> there, including Iceberg catalogs using a strategy similar to what 
>>>>>>>>>>> Spark
>>>>>>>>>>> does: what external Iceberg catalogs are available and their 
>>>>>>>>>>> configuration
>>>>>>>>>>> should come from the HiveConf.
>>>>>>>>>>>
>>>>>>>>>>> The second way Hive is strange is that Hive needs to use its own
>>>>>>>>>>> MetaStore to track Hive table concerns. The MetaStore may have 
>>>>>>>>>>> tables
>>>>>>>>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able 
>>>>>>>>>>> to load
>>>>>>>>>>> tables from other Iceberg catalogs by creating table entries for 
>>>>>>>>>>> them.
>>>>>>>>>>>
>>>>>>>>>>> Here’s how I think Hive should work:
>>>>>>>>>>>
>>>>>>>>>>>    - There should be a default HiveCatalog that uses the
>>>>>>>>>>>    current MetaStore URI to be used for HiveCatalog tables tracked 
>>>>>>>>>>> in the
>>>>>>>>>>>    MetaStore
>>>>>>>>>>>    - Other catalogs should be defined in HiveConf
>>>>>>>>>>>    - HMS table properties should be used to determine how to
>>>>>>>>>>>    load a table: using a Hadoop location, using the default 
>>>>>>>>>>> metastore catalog,
>>>>>>>>>>>    or using an external Iceberg catalog
>>>>>>>>>>>       - If there is a metadata_location, then use the
>>>>>>>>>>>       HiveCatalog for this metastore (where it is tracked)
>>>>>>>>>>>       - If there is a catalog property, then load that catalog
>>>>>>>>>>>       and use it to load the table identifier, or maybe an 
>>>>>>>>>>> identifier from HMS
>>>>>>>>>>>       table properties
>>>>>>>>>>>       - If there is no catalog or metadata_location, then use
>>>>>>>>>>>       HadoopTables to load the table location as an Iceberg table
>>>>>>>>>>>
>>>>>>>>>>> This would make it possible to access all types of Iceberg
>>>>>>>>>>> tables in the same query, and would match how Spark and Flink 
>>>>>>>>>>> configure
>>>>>>>>>>> catalogs. Other than the configuration above, I don’t think that 
>>>>>>>>>>> config in
>>>>>>>>>>> HMS should be used at all, like how the other engines work. Iceberg 
>>>>>>>>>>> is the
>>>>>>>>>>> source of truth for table metadata, HMS stores how to load the 
>>>>>>>>>>> Iceberg
>>>>>>>>>>> table, and HiveConf defines the catalogs (or runtime overrides).
>>>>>>>>>>>
>>>>>>>>>>> This isn’t quite how configuration works right now. Currently,
>>>>>>>>>>> the catalog is controlled by a HiveConf property,
>>>>>>>>>>> iceberg.mr.catalog. If that isn’t set, HadoopTables will be
>>>>>>>>>>> used to load table locations. If it is set, then that catalog will 
>>>>>>>>>>> be used
>>>>>>>>>>> to load all tables by name. This makes it impossible to load tables 
>>>>>>>>>>> from
>>>>>>>>>>> different catalogs at the same time. That’s why I think the Iceberg 
>>>>>>>>>>> catalog
>>>>>>>>>>> for a table should be stored in HMS table properties.
>>>>>>>>>>>
>>>>>>>>>>> I should also explain iceberg.hive.engine.enabled flag, but I
>>>>>>>>>>> think this is long enough for now.
>>>>>>>>>>>
>>>>>>>>>>> rb
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter <
>>>>>>>>>>> lpin...@cloudera.com.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like to start a discussion, how should we handle
>>>>>>>>>>>> properties from various sources like Iceberg, Hive or global 
>>>>>>>>>>>> configuration.
>>>>>>>>>>>> I've put together a short document
>>>>>>>>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>>>>>>>>> please have a look and let me know what you think.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Laszlo
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Ryan Blue
>>>>>>>>>>> Software Engineer
>>>>>>>>>>> Netflix
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Netflix
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

Re: Iceberg/Hive properties handling

Reply via email to