momo-jun commented on code in PR #18434: URL: https://github.com/apache/pulsar/pull/18434#discussion_r1042922337
########## site2/docs/schema-understand.md: ########## @@ -48,92 +46,227 @@ Pulsar supports various schema types, which are mainly divided into two categori The following table outlines the primitive types that Pulsar schema supports, and the conversions between **schema types** and **language-specific primitive types**. -| Primitive Type | Description | Java Type| Python Type | Go Type | -|---|---|---|---|---| -| `BOOLEAN` | A binary value | boolean | bool | bool | -| `INT8` | A 8-bit signed integer | int | | int8 | -| `INT16` | A 16-bit signed integer | int | | int16 | -| `INT32` | A 32-bit signed integer | int | | int32 | -| `INT64` | A 64-bit signed integer | int | | int64 | -| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number | float | float | float32 | -| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number | double | float | float64| -| `BYTES` | A sequence of 8-bit unsigned bytes | byte[], ByteBuffer, ByteBuf | bytes | []byte | -| `STRING` | A Unicode character sequence | string | str | string| -| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value | java.sql.Timestamp (java.sql.Time, java.util.Date) | | | -| INSTANT | A single instantaneous point on the time-line with nanoseconds precision| java.time.Instant | | | -| LOCAL_DATE | An immutable date-time object that represents a date, often viewed as year-month-day| java.time.LocalDate | | | -| LOCAL_TIME | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision.| java.time.LocalDateTime | | -| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second | java.time.LocalTime | | +| Primitive Type | Description | Java Type| Python Type | Go Type | C++ Type | C# Type| +|---|---|---|---|---|---|---| +| `BOOLEAN` | A binary value. | boolean | bool | bool | bool | bool | +| `INT8` | A 8-bit signed integer. | int | int | int8 | int8_t | byte | +| `INT16` | A 16-bit signed integer. | int | int | int16 | int16_t | short | +| `INT32` | A 32-bit signed integer. | int | int | int32 | int32_t | int | +| `INT64` | A 64-bit signed integer. | int | int | int64 | int64_t | long | +| `FLOAT` | A single precision (32-bit) IEEE 754 floating-point number. | float | float | float32 | float | float | +| `DOUBLE` | A double-precision (64-bit) IEEE 754 floating-point number. | double | double | float64| double | double | +| `BYTES` | A sequence of 8-bit unsigned bytes. | byte[], ByteBuffer, ByteBuf | bytes | []byte | void * | byte[], `ReadOnlySequence<byte>` | +| `STRING` | An Unicode character sequence. | string | str | string| std::string | string | +| `TIMESTAMP` (`DATE`, `TIME`) | A logic type represents a specific instant in time with millisecond precision. <br />It stores the number of milliseconds since `January 1, 1970, 00:00:00 GMT` as an `INT64` value. | java.sql.Timestamp (java.sql.Time, java.util.Date) | N/A | N/A | N/A | DateTime,TimeSpan | +| `INSTANT`| A single instantaneous point on the timeline with nanoseconds precision. | java.time.Instant | N/A | N/A | N/A | N/A | +| `LOCAL_DATE` | An immutable date-time object that represents a date, often viewed as year-month-day. | java.time.LocalDate | N/A | N/A | N/A | N/A | +| `LOCAL_TIME` | An immutable date-time object that represents a time, often viewed as hour-minute-second. Time is represented to nanosecond precision. | java.time.LocalDateTime | N/A | N/A | N/A | N/A | +| LOCAL_DATE_TIME | An immutable date-time object that represents a date-time, often viewed as year-month-day-hour-minute-second. | java.time.LocalTime | N/A | N/A | N/A | N/A | + +:::note -For primitive types, Pulsar does not store any schema data in `SchemaInfo`. The `type` in `SchemaInfo` determines how to serialize and deserialize the data. +Pulsar does not store any schema data in `SchemaInfo` for primitive types. Some of the primitive schema implementations can use the `properties` parameter to store implementation-specific tunable settings. For example, a string schema can use `properties` to store the encoding charset to serialize and deserialize strings. -Some of the primitive schema implementations can use `properties` to store implementation-specific tunable settings. For example, a `string` schema can use `properties` to store the encoding charset to serialize and deserialize strings. +::: -For more instructions, see [Construct a string schema](schema-get-started.md#construct-a-string-schema). +For more instructions and examples, see [Construct a string schema](schema-get-started.md#string). ### Complex type -Currently, Pulsar supports the following complex types: +The following table outlines the complex types that Pulsar schema supports: | Complex Type | Description | |---|---| -| `KeyValue` | Represents a complex type of a key/value pair. | -| `Struct` | Handles structured data. It supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. | +| `Keyvalue` | Represents a complex key/value pair. | +| `Struct` | Represents structured data, including `AvroBaseStructSchema`, `ProtobufNativeSchema` and `Schema.NATIVE_AVRO`. | #### `KeyValue` schema -`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of key schema and the value schema together. +`KeyValue` schema helps applications define schemas for both key and value. Pulsar stores the `SchemaInfo` of the key schema and the value schema together. -You can choose the encoding type when constructing the key/value schema.: +Pulsar provides the following methods to encode a **single** key/value pair in a message: * `INLINE` - Key/value pairs are encoded together in the message payload. -* `SEPARATED` - see [Construct a key/value schema](schema-get-started.md#construct-a-keyvalue-schema). +* `SEPARATED` - The Key is stored as a message key, while the value is stored as the message payload. See [Construct a key/value schema](schema-get-started.md#keyvalue) for more details. #### `Struct` schema -`Struct` schema supports `AvroBaseStructSchema` and `ProtobufNativeSchema`. +The following table outlines the `struct` types that Pulsar schema supports: |Type|Description| ---|---| -`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar:<br />- to use the same tools to manage schema definitions<br />- to use different serialization or deserialization methods to handle data| -`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar:<br />- to use native protobuf-v3 to serialize or deserialize data<br />- to use `AutoConsume` to deserialize data. +`AvroBaseStructSchema`|Pulsar uses [Avro Specification](http://avro.apache.org/docs/current/spec.html) to declare the schema definition for `AvroBaseStructSchema`, which supports `AvroSchema`, `JsonSchema`, and `ProtobufSchema`. <br /><br />This allows Pulsar to:<br />- use the same tools to manage schema definitions.<br />- use different serialization or deserialization methods to handle data. | +`ProtobufNativeSchema`|`ProtobufNativeSchema` is based on protobuf native Descriptor. <br /><br />This allows Pulsar to:<br />- use native protobuf-v3 to serialize or deserialize data<br />- use `AutoConsume` to deserialize data.| +`Schema.NATIVE_AVRO` | `Schema.NATIVE_AVRO` is used to wrap a native Avro schema type `org.apache.avro.Schema`. The result is a schema instance that accepts a serialized Avro payload without validating it against the wrapped Avro schema. <br /><br />When you migrate or ingest event or messaging data from external systems (such as Kafka and Cassandra), the data is often already serialized in Avro format. The applications producing the data typically have validated the data against their schemas (including compatibility checks) and stored them in a database or a dedicated service (such as schema registry). The schema of each serialized data record is usually retrievable by some metadata attached to that record. In such cases, a Pulsar producer doesn't need to repeat the schema validation when sending the ingested events to a topic. All it needs to do is pass each message or event with its schema to Pulsar. | Review Comment: Makes sense. I've added these links. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
