GitHub user usbrandon added a comment to the discussion: Further LLM Support

What do you guys think the data types should be for handling vector embeddings 
in Hop?
I see there is some slight variations when ultimately querying and retrieving 
from databases.
For example Snowflake has a very new VECTOR datatype, but it only supports 
(FLOAT, INT) up to 4,096 values. However, they also have ARRAY and STRUCTURED 
array types that can handle larger.
Then as another example, there is pgvector which adds a
https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector
 which uses the pgvector extension for Postgres here 
https://github.com/pgvector/pgvector 

That extension also suggests some of its own datatypes 
vector (2,000 dimension limit)
halfvec (4,000 dimension limit)
bit (64,000 dimension limit)
etc.

Just to be clear I do see these two things, Hop's internal storage, and 
external database query/serialization as two separate issues, but I think it is 
useful to consider the inputs and outputs to anticipate the impact of the 
choice.  Serializing them in and out to XML and a database will be handled by 
various steps.

Test ETL we'd write would have Data Grid in them, so we need to represent those 
easily.  People would probably bring over some JSON versions of vectors that we 
would need good serialization to and from source type.

As a user, I think I would lean towards ease of use for the input / output 
(copy/paste).
We can consider the upgrades to the JSON Input / JSON Output steps.
Finally, there is that serialization / deserialization to our databases of 
choice.

If we can come clear on the data types to use for Hop itself, each other 
question is a matter of conversion. 

GitHub link: 
https://github.com/apache/hop/discussions/4732#discussioncomment-11739770

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to