GitHub user usbrandon added a comment to the discussion: Further LLM Support
What do you guys think the data types should be for handling vector embeddings in Hop? I see there is some slight variations when ultimately querying and retrieving from databases. For example Snowflake has a very new VECTOR datatype, but it only supports (FLOAT, INT) up to 4,096 values. However, they also have ARRAY and STRUCTURED array types that can handle larger. Then as another example, there is pgvector which adds a https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector which uses the pgvector extension for Postgres here https://github.com/pgvector/pgvector That extension also suggests some of its own datatypes vector (2,000 dimension limit) halfvec (4,000 dimension limit) bit (64,000 dimension limit) etc. Just to be clear I do see these two things, Hop's internal storage, and external database query/serialization as two separate issues, but I think it is useful to consider the inputs and outputs to anticipate the impact of the choice. Serializing them in and out to XML and a database will be handled by various steps. Test ETL we'd write would have Data Grid in them, so we need to represent those easily. People would probably bring over some JSON versions of vectors that we would need good serialization to and from source type. As a user, I think I would lean towards ease of use for the input / output (copy/paste). We can consider the upgrades to the JSON Input / JSON Output steps. Finally, there is that serialization / deserialization to our databases of choice. If we can come clear on the data types to use for Hop itself, each other question is a matter of conversion. GitHub link: https://github.com/apache/hop/discussions/4732#discussioncomment-11739770 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
