Chronos-LYH opened a new issue, #5631:
URL: https://github.com/apache/iceberg/issues/5631

   ### Feature Request / Improvement
   
   In ML scenarios, we may want Iceberg schemas to include additional 
information about a field. For example:
   For an integer field representing a feature, we need information indicating 
whether the feature is continuous or categorical:
   ```
   {“type”: “continuous”}
   {“type”: “categorical”, “categories”: [“US”, “CA”, “CN”, ...]}
   ```
   For a list field representing multiple features, we may want information on 
some of the features:
   ```
   {
     "features": [
       {
         "index": 0,
         "name": "age",
         "type": "continuous"
       },
       {
         "index": 5,
         "name": "gender",
         "type": "categorical",
         "categories": [
           "male",
           "female"
         ]
       }
     ]
   }
   ```
   For a binary field representing a custom-encoded feature, we need 
information on the encoding.
   ```
    {"encoding": "feature_id_v1"}
    {"encoding": "feature_id_v2"}
   ```
   Spark has a metadata field in its StructType class since Spark 1.2 
(https://issues.apache.org/jira/browse/SPARK-3569), so that Spark DataFrames 
can hold ML-specific information as mentioned above.
   
   Referring to Spark's implementation, Iceberg can add a metadata field to the 
NestedField class. When declaring an Iceberg NestedField, users can provide an 
"metadata" argument with additional information about the field.
   ```
   Schema schema = new Schema(
           required(1, "feature1", Types.IntegerType.get(), null, 
Metadata.fromJson("
                   {“type”: “categorical”, “categories”: [“US”, “CA”, “CN”]}
           "))),
           required(2, "feature2", Types.IntegerType.get(), null, 
Metadata.fromJson("
                   {“type”: “continuous”}
           ")))
   );
   ```
   The metadata in Iceberg and Spark should be able to convert to each other, 
so that the field metadata in Iceberg can be passed to Spark DataFrames. Also 
DataFrames will be able to preserve field metadata when saved in iceberg format.
   
   See also:
   
https://docs.google.com/document/d/1RGJgVJhCebnilpL15ODcq0EWBeVjl9ltoHUvosWodPg/edit#
   
   ### Query engine
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to