Re: [D] Further LLM Support (hop)

via GitHub Sun, 05 Jan 2025 06:44:22 -0800


GitHub user usbrandon added a comment to the discussion: Further LLM Support

What do you guys think the data types should be for handling vector embeddings
in Hop?
I see there is some slight variations when ultimately querying and retrieving
from databases.
For example Snowflake has a very new VECTOR datatype, but it only supports
(FLOAT, INT) up to 4,096 values. However, they also have ARRAY and STRUCTURED
array types that can handle larger.
Then as another example, there is pgvector which adds a
https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector
which uses the pgvector extension for Postgres here
https://github.com/pgvector/pgvector

That extension also suggests some of its own datatypes
vector (2,000 dimension limit)
halfvec (4,000 dimension limit)
bit (64,000 dimension limit)
etc.

Just to be clear I do see these two things, Hop's internal storage, and
external database query/serialization as two separate issues, but I think it is
useful to consider the inputs and outputs to anticipate the impact of the
choice. Serializing them in and out to XML and a database will be handled by
various steps.

Test ETL we'd write would have Data Grid in them, so we need to represent those
easily. People would probably bring over some JSON versions of vectors that we
would need good serialization to and from source type.

As a user, I think I would lean towards ease of use for the input / output
(copy/paste).
We can consider the upgrades to the JSON Input / JSON Output steps.
Finally, there is that serialization / deserialization to our databases of
choice.

If we can come clear on the data types to use for Hop itself, each other
question is a matter of conversion.

GitHub link:
https://github.com/apache/hop/discussions/4732#discussioncomment-11739770

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Further LLM Support (hop)

Reply via email to