Re: Iceberg/Hive properties handling

Zoltán Borók-Nagy Thu, 26 Nov 2020 01:45:33 -0800

Hi,

The above aligns with what we did in Impala, i.e. we store information
about table loading in HMS table properties. We are just a bit more
explicit about which catalog to use.
We have table property 'iceberg.catalog' to determine the catalog type,
right now the supported values are 'hadoop.tables', 'hadoop.catalog', and
'hive.catalog'. Additional table properties can be set based on the catalog
type.


So, if the value of 'iceberg.catalog' is

   - hadoop.tables
      - the table location is used to load the table
   - hadoop.catalog
      - Required table property 'iceberg.catalog_location' specifies the
      location of the hadoop catalog in the file system
      - Optional table property 'iceberg.table_identifier' specifies the
      table id. If it's not set, then <database_name>.<table_name> is used as
      table identifier
   - hive.catalog
      - Optional table property 'iceberg.table_identifier' specifies the
      table id. If it's not set, then <database_name>.<table_name> is used as
      table identifier
      - We have the assumption that the current Hive metastore stores the
      table, i.e. we don't support external Hive metastores currently

Independent of catalog implementations, but we also have table property
'iceberg.file_format' to specify the file format for the data files.

We haven't released it yet, so we are open to changes, but I think these
properties are reasonable and it would be great if we could standardize the
properties across engines that use HMS as the primary metastore of tables.

Cheers,
    Zoltan


On Thu, Nov 26, 2020 at 2:20 AM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Yes, I think that is a good summary of the principles.
>
> #4 is correct because we provide some information that is informational
> (Hive schema) or tracked only by the metastore (best-effort current user).
> I also agree that it would be good to have a table identifier in HMS table
> metadata when loading from an external table. That gives us a way to handle
> name conflicts.
>
> On Wed, Nov 25, 2020 at 5:14 PM Jacques Nadeau <jacq...@dremio.com> wrote:
>
>> Minor error, my last example should have been:
>>
>> db1.table1_etl_branch => nessie.folder1.folder2.folder3.table1@etl_branch
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>>
>> On Wed, Nov 25, 2020 at 4:56 PM Jacques Nadeau <jacq...@dremio.com>
>> wrote:
>>
>>> I agree with Ryan on the core principles here. As I understand them:
>>>
>>>    1. Iceberg metadata describes all properties of a table
>>>    2. Hive table properties describe "how to get to" Iceberg metadata
>>>    (which catalog + possibly ptr, path, token, etc)
>>>    3. There could be default "how to get to" information set at a
>>>    global level
>>>    4. Best-effort schema should stored be in the table properties in
>>>    HMS. This should be done for information schema retrieval purposes within
>>>    Hive but should be ignored during Hive/other tool execution.
>>>
>>> Is that a fair summary of your statements Ryan (except 4, which I just
>>> added)?
>>>
>>> One comment I have on #2 is that for different catalogs and use cases, I
>>> think it can be somewhat more complex where it would be desirable for a
>>> table that initially existed without Hive that was later exposed in Hive to
>>> support a ptr/path/token for how the table is named externally. For
>>> example, in a Nessie context we support arbitrary paths for an Iceberg
>>> table (such as folder1.folder2.folder3.table1). If you then want to expose
>>> that table to Hive, you might have this mapping for #2
>>>
>>> db1.table1 => nessie:folder1.folder2.folder3.table1
>>>
>>> Similarly, you might want to expose a particular branch version of a
>>> table. So it might say:
>>>
>>> db1.table1_etl_branch => nessie.folder1@etl_branch
>>>
>>> Just saying that the address to the table in the catalog could itself
>>> have several properties. The key being that no matter what those are, we
>>> should follow #1 and only store properties that are about the ptr, not the
>>> content/metadata.
>>>
>>> Lastly, I believe #4 is the case but haven't tested it. Can someone
>>> confirm that it is true? And that it is possible/not problematic?
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>>
>>> On Wed, Nov 25, 2020 at 4:28 PM Ryan Blue <rb...@netflix.com.invalid>
>>> wrote:
>>>
>>>> Thanks for working on this, Laszlo. I’ve been thinking about these
>>>> problems as well, so this is a good time to have a discussion about Hive
>>>> config.
>>>>
>>>> I think that Hive configuration should work mostly like other engines,
>>>> where different configurations are used for different purposes. Different
>>>> purposes means that there is not a global configuration priority.
>>>> Hopefully, I can explain how we use the different config sources elsewhere
>>>> to clarify.
>>>>
>>>> Let’s take Spark as an example. Spark uses Hadoop, so it has a Hadoop
>>>> Configuration, but it also has its own global configuration. There are also
>>>> Iceberg table properties, and all of the various Hive properties if you’re
>>>> tracking tables with a Hive MetaStore.
>>>>
>>>> The first step is to simplify where we can, so we effectively eliminate
>>>> 2 sources of config:
>>>>
>>>>    - The Hadoop Configuration is only used to instantiate Hadoop
>>>>    classes, like FileSystem. Iceberg should not use it for any other 
>>>> config.
>>>>    - Config in the Hive MetaStore is only used to identify that a
>>>>    table is Iceberg and point to its metadata location. All other config in
>>>>    HMS is informational. For example, the input format is FileInputFormat 
>>>> so
>>>>    that non-Iceberg readers cannot actually instantiate the format (it’s
>>>>    abstract) but it is available so they also don’t fail trying to load the
>>>>    class. Table-specific config should not be stored in table or serde
>>>>    properties.
>>>>
>>>> That leaves Spark configuration and Iceberg table configuration.
>>>>
>>>> Iceberg differs from other tables because it is opinionated: data
>>>> configuration should be maintained at the table level. This is cleaner for
>>>> users because config is standardized across engines and in one place. And
>>>> it also enables services that analyze a table and update its configuration
>>>> to tune options that users almost never do, like row group or stripe size
>>>> in the columnar formats. Iceberg table configuration is used to configure
>>>> table-specific concerns and behavior.
>>>>
>>>> Spark configuration is used for engine-specific concerns, and runtime
>>>> overrides. A good example of an engine-specific concern is the catalogs
>>>> that are available to load Iceberg tables. Spark has a way to load and
>>>> configure catalog implementations and Iceberg uses that for all
>>>> catalog-level config. Runtime overrides are things like target split size.
>>>> Iceberg has a table-level default split size in table properties, but this
>>>> can be overridden by a Spark option for each table, as well as an option
>>>> passed to the individual read. Note that these necessarily have different
>>>> config names for how they are used: Iceberg uses read.split.target-size
>>>> and the read-specific option is target-size.
>>>>
>>>> Applying this to Hive is a little strange for a couple reasons. First,
>>>> Hive’s engine configuration *is* a Hadoop Configuration. As a result,
>>>> I think the right place to store engine-specific config is there, including
>>>> Iceberg catalogs using a strategy similar to what Spark does: what external
>>>> Iceberg catalogs are available and their configuration should come from the
>>>> HiveConf.
>>>>
>>>> The second way Hive is strange is that Hive needs to use its own
>>>> MetaStore to track Hive table concerns. The MetaStore may have tables
>>>> created by an Iceberg HiveCatalog, and Hive also needs to be able to load
>>>> tables from other Iceberg catalogs by creating table entries for them.
>>>>
>>>> Here’s how I think Hive should work:
>>>>
>>>>    - There should be a default HiveCatalog that uses the current
>>>>    MetaStore URI to be used for HiveCatalog tables tracked in the MetaStore
>>>>    - Other catalogs should be defined in HiveConf
>>>>    - HMS table properties should be used to determine how to load a
>>>>    table: using a Hadoop location, using the default metastore catalog, or
>>>>    using an external Iceberg catalog
>>>>       - If there is a metadata_location, then use the HiveCatalog for
>>>>       this metastore (where it is tracked)
>>>>       - If there is a catalog property, then load that catalog and use
>>>>       it to load the table identifier, or maybe an identifier from HMS 
>>>> table
>>>>       properties
>>>>       - If there is no catalog or metadata_location, then use
>>>>       HadoopTables to load the table location as an Iceberg table
>>>>
>>>> This would make it possible to access all types of Iceberg tables in
>>>> the same query, and would match how Spark and Flink configure catalogs.
>>>> Other than the configuration above, I don’t think that config in HMS should
>>>> be used at all, like how the other engines work. Iceberg is the source of
>>>> truth for table metadata, HMS stores how to load the Iceberg table, and
>>>> HiveConf defines the catalogs (or runtime overrides).
>>>>
>>>> This isn’t quite how configuration works right now. Currently, the
>>>> catalog is controlled by a HiveConf property, iceberg.mr.catalog. If
>>>> that isn’t set, HadoopTables will be used to load table locations. If it is
>>>> set, then that catalog will be used to load all tables by name. This makes
>>>> it impossible to load tables from different catalogs at the same time.
>>>> That’s why I think the Iceberg catalog for a table should be stored in HMS
>>>> table properties.
>>>>
>>>> I should also explain iceberg.hive.engine.enabled flag, but I think
>>>> this is long enough for now.
>>>>
>>>> rb
>>>>
>>>> On Wed, Nov 25, 2020 at 1:41 AM Laszlo Pinter
>>>> <lpin...@cloudera.com.invalid> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I would like to start a discussion, how should we handle properties
>>>>> from various sources like Iceberg, Hive or global configuration. I've put
>>>>> together a short document
>>>>> <https://docs.google.com/document/d/1tyD7mGp_hh0dx9N_Ax9kj5INkg7Wzpj9XQ5t2-7AwNs/edit?usp=sharing>,
>>>>> please have a look and let me know what you think.
>>>>>
>>>>> Thanks,
>>>>> Laszlo
>>>>>
>>>>
>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Netflix
>>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Iceberg/Hive properties handling

Reply via email to