tengqm commented on code in PR #7009:
URL: https://github.com/apache/gravitino/pull/7009#discussion_r2051299179
##########
docs/manage-fileset-metadata-using-gravitino.md:
##########
@@ -315,16 +315,52 @@ Currently, Gravitino supports two **types** of filesets:
specified as `EXTERNAL`, the files of the fileset will **not** be deleted
when
the fileset is dropped.
-**storageLocation**
+:::note
+If the locations of the manged fileset do not exist, Gravitino will
create/delete the locations when the fileset is created/deleted.
+Unless the catalog property `disable-filesystem-ops` is set to true or the
location contains a
[placeholder](./manage-fileset-metadata-using-gravitino.md#placeholder).
+:::
+
+#### storageLocation
The `storageLocation` is the physical location of the fileset. Users can
specify this location
when creating a fileset, or follow the rules of the catalog/schema location if
not specified.
+The value of `storageLocation` depends on the configuration settings of the
catalog:
+- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
+- If this is a HDFS fileset catalog, the `storageLocation` should be in the
format of `hdfs://namenode:port/path/to/fileset`.
Review Comment:
```suggestion
- For a HDFS fileset catalog, the `storageLocation` should be in the format
of `hdfs://namenode:port/path/to/fileset`.
```
##########
docs/manage-fileset-metadata-using-gravitino.md:
##########
@@ -429,34 +465,198 @@
catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
</TabItem>
</Tabs>
-The value of `storageLocation` depends on the configuration settings of the
catalog:
-- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
-- If this is a HDFS fileset catalog, the `storageLocation` should be in the
format of `hdfs://namenode:port/path/to/fileset`.
+#### storageLocations
+You can also create a fileset with multiple storage locations. The
`storageLocations` is a map of location name to storage location.
+The generation rules of each location follow the generation rules of a single
location.
+The following is an example of creating a fileset with multiple storage
locations:
-For a `MANAGED` fileset, the storage location is:
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
-1. The one specified by the user during the fileset creation, and the
placeholder will be replaced by the
- corresponding fileset property value.
-2. When the catalog property `location` is specified but the schema property
`location` isn't specified, the storage location is:
- 1. `catalog location/schema name/fileset name` if `catalog location` does
not contain any placeholder.
- 2. `catalog location` - placeholders in the catalog location will be
replaced by the corresponding fileset property value.
+```shell
+# create a catalog first
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "filesystem-providers": "builtin-local,builtin-hdfs,s3,gcs",
+ "location-l1":
"file:///{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}",
+ "location-l2":
"hdfs:///{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
-3. When the catalog property `location` isn't specified but the schema
property `location` is specified,
- the storage location is:
- 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
- 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
-
-4. When both the catalog property `location` and the schema property
`location` are specified, the storage
- location is:
- 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
- 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
+# create a schema under the catalog
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "comment",
+ "properties": {
+ "location-l3":
"s3a://myBucket/{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
-5. When both the catalog property `location` and schema property `location`
isn't specified, the user
- should specify the `storageLocation` in the fileset creation.
+# create a fileset by placeholders
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocations": {
+ "l4": "gs://myBucket/{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ },
+ "properties": {
+ "placeholder-project": "test_project",
+ "placeholder-user": "test_user",
+ "default-location-name": "l1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
-For `EXTERNAL` fileset, users should specify `storageLocation` during the
fileset creation,
-otherwise, Gravitino will throw an exception. If the `storageLocation`
contains placeholders, the
-placeholder will be replaced by the corresponding fileset property value.
+# the fileset will be created with 4 storage locations:
+{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": null,
+ "storageLocations": {
Review Comment:
`storageLocation` is for backward compatibility?
Why do we have both singular and plural forms of the same property?
Can we unify this into just a `storageLocations` map?
##########
docs/hadoop-catalog.md:
##########
@@ -125,14 +132,15 @@ Refer to [Schema
operation](./manage-fileset-metadata-using-gravitino.md#schema-
### Fileset properties
-| Property name | Description
| Default
value | Required | Immutable | Since Version |
-|---------------------------------------|--------------------------------------------------------------------------------------------------------|--------------------------|----------|-----------|------------------|
-| `authentication.impersonation-enable` | Whether to enable impersonation for
the Hadoop catalog fileset. | The
parent(schema) value | No | Yes | 0.6.0-incubating |
-| `authentication.type` | The type of authentication for
Hadoop catalog fileset, currently we only support `kerberos`, `simple`. | The
parent(schema) value | No | No | 0.6.0-incubating |
-| `authentication.kerberos.principal` | The principal of the Kerberos
authentication for the fileset. | The
parent(schema) value | No | No | 0.6.0-incubating |
-| `authentication.kerberos.keytab-uri` | The URI of The keytab for the
Kerberos authentication for the fileset. | The
parent(schema) value | No | No | 0.6.0-incubating |
-| `credential-providers` | The credential provider types,
separated by comma. |
(none) | No | No | 0.8.0-incubating |
-| `placeholder-` | Properties that start with
`placeholder-` are used to replace placeholders in the location. |
(none) | No | Yes | 0.9.0-incubating |
+| Property name | Description
| Default value
| Required |
Immutable | Since Version |
+|---------------------------------------|----------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|--------------------------------------------|-----------|------------------|
+| `authentication.impersonation-enable` | Whether to enable impersonation for
the Hadoop catalog fileset.
| The parent(schema) value
| No
| Yes | 0.6.0-incubating |
+| `authentication.type` | The type of authentication for
Hadoop catalog fileset, currently we only support `kerberos`, `simple`.
| The parent(schema) value
| No
| No | 0.6.0-incubating |
+| `authentication.kerberos.principal` | The principal of the Kerberos
authentication for the fileset.
| The parent(schema) value
| No
| No | 0.6.0-incubating |
+| `authentication.kerberos.keytab-uri` | The URI of The keytab for the
Kerberos authentication for the fileset.
| The parent(schema) value
| No
| No | 0.6.0-incubating |
+| `credential-providers` | The credential provider types,
separated by comma.
| (none)
| No
| No | 0.8.0-incubating |
+| `placeholder-` | Properties that start with
`placeholder-` are used to replace placeholders in the location.
| (none)
| No
| Yes | 0.9.0-incubating |
+| `default-location-name` | The name of the default location of
the fileset, mainly used for GVFS operations without specifying a location
name. | When the fileset has only one location, its location name will be
automatically selected as the default value. | Yes, if the fileset has multiple
locations | Yes | 0.9.0-incubating |
Review Comment:
This is an example why Markdown table is not suitable for wide tables.
Every time we add a new row to the table, the whole table is changed.
It is hard to tell what the real changes are.
##########
docs/hadoop-catalog.md:
##########
@@ -104,19 +105,25 @@ The Hadoop catalog supports creating, updating, deleting,
and listing schema.
### Schema properties
-| Property name | Description
|
Default value | Required | Since Version |
-|---------------------------------------|----------------------------------------------------------------------------------------------------------------|---------------------------|----------|------------------|
-| `location` | The storage location managed by
Hadoop schema.
| (none) | No | 0.5.0 |
-| `authentication.impersonation-enable` | Whether to enable impersonation for
this schema of the Hadoop catalog. |
The parent(catalog) value | No | 0.6.0-incubating |
-| `authentication.type` | The type of authentication for this
schema of Hadoop catalog , currently we only support `kerberos`, `simple`. |
The parent(catalog) value | No | 0.6.0-incubating |
-| `authentication.kerberos.principal` | The principal of the Kerberos
authentication for this schema.
| The parent(catalog) value | No | 0.6.0-incubating |
-| `authentication.kerberos.keytab-uri` | The URI of The keytab for the
Kerberos authentication for this schema.
| The parent(catalog) value | No | 0.6.0-incubating |
-| `credential-providers` | The credential provider types,
separated by comma.
| (none) | No | 0.8.0-incubating |
+| Property name | Description
| Default value | Required | Since Version |
+|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|---------------------------|----------|------------------|
+| `location` | The storage location managed by
Hadoop schema. It's location name is `unknown`.
| (none) | No | 0.5.0 |
+| `location-` | The property prefix. User can use
`location-{name}=location` to set multiple locations with different names for
the schema. | (none) | No | 0.9.0-incubating |
+| `authentication.impersonation-enable` | Whether to enable impersonation for
this schema of the Hadoop catalog.
| The parent(catalog) value | No | 0.6.0-incubating |
+| `authentication.type` | The type of authentication for this
schema of Hadoop catalog , currently we only support `kerberos`, `simple`.
| The parent(catalog) value | No | 0.6.0-incubating |
+| `authentication.kerberos.principal` | The principal of the Kerberos
authentication for this schema.
| The parent(catalog) value | No | 0.6.0-incubating |
+| `authentication.kerberos.keytab-uri` | The URI of The keytab for the
Kerberos authentication for this schema.
| The parent(catalog) value | No | 0.6.0-incubating |
+| `credential-providers` | The credential provider types,
separated by comma.
| (none) | No | 0.8.0-incubating |
### Schema operations
Refer to [Schema
operation](./manage-fileset-metadata-using-gravitino.md#schema-operations) for
more details.
+:::note
+If the locations of the schema do not exist, Gravitino will create/delete the
locations when the schema is created/deleted.
+Unless the catalog property `disable-filesystem-ops` is set to true or the
location contains a
[placeholder](./manage-fileset-metadata-using-gravitino.md#placeholder).
Review Comment:
Incomplete sentence.
##########
docs/manage-fileset-metadata-using-gravitino.md:
##########
@@ -315,16 +315,52 @@ Currently, Gravitino supports two **types** of filesets:
specified as `EXTERNAL`, the files of the fileset will **not** be deleted
when
the fileset is dropped.
-**storageLocation**
+:::note
+If the locations of the manged fileset do not exist, Gravitino will
create/delete the locations when the fileset is created/deleted.
+Unless the catalog property `disable-filesystem-ops` is set to true or the
location contains a
[placeholder](./manage-fileset-metadata-using-gravitino.md#placeholder).
+:::
+
+#### storageLocation
The `storageLocation` is the physical location of the fileset. Users can
specify this location
when creating a fileset, or follow the rules of the catalog/schema location if
not specified.
+The value of `storageLocation` depends on the configuration settings of the
catalog:
+- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
+- If this is a HDFS fileset catalog, the `storageLocation` should be in the
format of `hdfs://namenode:port/path/to/fileset`.
+
+For a `MANAGED` fileset, the storage location is:
+
+1. The one specified by the user during the fileset creation, and the
[placeholder](#placeholder) will be replaced by the
+ corresponding fileset property value.
+2. When the catalog property `location` is specified but the schema property
`location` isn't specified, the storage location is:
+ 1. `catalog location/schema name/fileset name` if `catalog location` does
not contain any placeholder.
+ 2. `catalog location` - placeholders in the catalog location will be
replaced by the corresponding fileset property value.
+
+3. When the catalog property `location` isn't specified but the schema
property `location` is specified,
+ the storage location is:
+ 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
+ 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
+
+4. When both the catalog property `location` and the schema property
`location` are specified, the storage
+ location is:
+ 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
+ 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
+
+5. When both the catalog property `location` and schema property `location`
isn't specified, the user
+ should specify the `storageLocation` in the fileset creation.
Review Comment:
The logic is not very complicated, but the description is difficult to parse.
Based on the description above, the logic is
- if the `storageLocation` parameter is provided in the fileset creation
request,
we'll use it with placeholders (if any) properly substituted, or
- if `location` is specified in the schema, we use `<schema
location>/<fileset name>`,
where the placeholders (if any) in the `<schema location>` are substituted
properly
using fileset property values. Or,
- if `location` is specified in the catalog, we use `<catalog
location>/<schema name>/<fileset name>`,
where the placeholders (if any) in the `<catalog location>` are
substituted properly
using fileset property values. Or,
- we throw an error for missing storage location information.
##########
docs/manage-fileset-metadata-using-gravitino.md:
##########
@@ -429,34 +465,198 @@
catalog.as_fileset_catalog().create_fileset(ident=NameIdentifier.of("test_schema
</TabItem>
</Tabs>
-The value of `storageLocation` depends on the configuration settings of the
catalog:
-- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
-- If this is a HDFS fileset catalog, the `storageLocation` should be in the
format of `hdfs://namenode:port/path/to/fileset`.
+#### storageLocations
+You can also create a fileset with multiple storage locations. The
`storageLocations` is a map of location name to storage location.
+The generation rules of each location follow the generation rules of a single
location.
+The following is an example of creating a fileset with multiple storage
locations:
-For a `MANAGED` fileset, the storage location is:
+<Tabs groupId="language" queryString>
+<TabItem value="shell" label="Shell">
-1. The one specified by the user during the fileset creation, and the
placeholder will be replaced by the
- corresponding fileset property value.
-2. When the catalog property `location` is specified but the schema property
`location` isn't specified, the storage location is:
- 1. `catalog location/schema name/fileset name` if `catalog location` does
not contain any placeholder.
- 2. `catalog location` - placeholders in the catalog location will be
replaced by the corresponding fileset property value.
+```shell
+# create a catalog first
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_catalog",
+ "type": "FILESET",
+ "comment": "comment",
+ "provider": "hadoop",
+ "properties": {
+ "filesystem-providers": "builtin-local,builtin-hdfs,s3,gcs",
+ "location-l1":
"file:///{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}",
+ "location-l2":
"hdfs:///{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
-3. When the catalog property `location` isn't specified but the schema
property `location` is specified,
- the storage location is:
- 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
- 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
-
-4. When both the catalog property `location` and the schema property
`location` are specified, the storage
- location is:
- 1. `schema location/fileset name` if `schema location` does not contain any
placeholder.
- 2. `schema location` - placeholders in the schema location will be replaced
by the corresponding fileset property value.
+# create a schema under the catalog
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "test_schema",
+ "comment": "comment",
+ "properties": {
+ "location-l3":
"s3a://myBucket/{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ }
+}' http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas
-5. When both the catalog property `location` and schema property `location`
isn't specified, the user
- should specify the `storageLocation` in the fileset creation.
+# create a fileset by placeholders
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+-H "Content-Type: application/json" -d '{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocations": {
+ "l4": "gs://myBucket/{{catalog}}/{{schema}}/workspace_{{project}}/{{user}}"
+ },
+ "properties": {
+ "placeholder-project": "test_project",
+ "placeholder-user": "test_user",
+ "default-location-name": "l1"
+ }
+}'
http://localhost:8090/api/metalakes/metalake/catalogs/test_catalog/schemas/test_schema/filesets
-For `EXTERNAL` fileset, users should specify `storageLocation` during the
fileset creation,
-otherwise, Gravitino will throw an exception. If the `storageLocation`
contains placeholders, the
-placeholder will be replaced by the corresponding fileset property value.
+# the fileset will be created with 4 storage locations:
+{
+ "name": "example_fileset",
+ "comment": "This is an example fileset",
+ "type": "MANAGED",
+ "storageLocation": null,
+ "storageLocations": {
Review Comment:
Another question ...
Is `storageLocation` a required property?
Why do we need to explicitly set a property that might be defaulted to null?
##########
docs/manage-fileset-metadata-using-gravitino.md:
##########
@@ -315,16 +315,52 @@ Currently, Gravitino supports two **types** of filesets:
specified as `EXTERNAL`, the files of the fileset will **not** be deleted
when
the fileset is dropped.
-**storageLocation**
+:::note
+If the locations of the manged fileset do not exist, Gravitino will
create/delete the locations when the fileset is created/deleted.
+Unless the catalog property `disable-filesystem-ops` is set to true or the
location contains a
[placeholder](./manage-fileset-metadata-using-gravitino.md#placeholder).
+:::
+
+#### storageLocation
The `storageLocation` is the physical location of the fileset. Users can
specify this location
when creating a fileset, or follow the rules of the catalog/schema location if
not specified.
+The value of `storageLocation` depends on the configuration settings of the
catalog:
+- If this is a local fileset catalog, the `storageLocation` should be in the
format of `file:///path/to/fileset`.
Review Comment:
```suggestion
- For a local fileset catalog, the `storageLocation` should be in the format
of `file:///path/to/fileset`.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]