GitHub user ep9io edited a comment on the discussion: Further LLM Support
I can share the approach I was taking with the use of LangChain4J's embeddings
and search, and see if it makes sense or helps with a direction.
The overall idea is to split the tasks into independent plugins so that they
can be reused more broadly and leverage other plugins within the HOP's
ecosystem.
- Create embeddings
- Persist embeddings.
- Retrieve embeddings
- Compare/search embeddings
## Plugin 1: Create the embeddings. In other words, _transform_ text into
number arrays.
Inputs: text field and model details
Outputs: array of numbers
This plugin selects an input field as the input value (e.g. text value) and
passes it through the model (typically an encoder like bert). What you get
back is a dense set (embeddings). They might also be called feature vectors,
word vectors, latent space, and to some extent logits. They can look like
this:
`[0.12, -0.34, 0.56, 0.78, -0.91]` This could represent a word in a
5-dimensional embedding space.
Your plugin already does this in the line highlighted below:
```
private void addToVector(String textValue, Object keyValue, Object[]
lookupRowValues)
...
TextSegment segment = TextSegment.from(textValue);
>>> Embedding embedding = data.embeddingModel.embed(segment).content(); <<<
String keyString = String.valueOf(keyValue);
data.embeddingStore.add(keyString, embedding);
...
}
```
These numbers on their own are quite useful and the output of this plugin is
only to produce these arrays and nothing else. Think of it as a pre-processing
step.
Attaching a parquet/csv/json output file after this plugin has many
applications. This plugin alone can be used to create a dataset that can be
used or reused further down the workflow. In other words, once you create the
embeddings, you don't need to be recreating them every time you are running a
task. The pipelines or external tasks further down the workflow could be
reading a parquet file that contains these embeddings.
## Plugin 2 (concept): Store the embeddings
Inputs: array of numbers
Output: Not sure, perhaps saving results or the IDs of the entries.
Your plugin implements this concept:
```
private void addToVector(String textValue, Object keyValue, Object[]
lookupRowValues)
...
TextSegment segment = TextSegment.from(textValue);
Embedding embedding = data.embeddingModel.embed(segment).content();
String keyString = String.valueOf(keyValue);
>>> data.embeddingStore.add(keyString, embedding); <<<
...
}
```
Saving to a text file or parquet is one option that's already out of the box in
HOP, and quite useful.
However, those are handy for backups, troubleshooting, or further analysis.
A more practical method is to store them into a database that's designed for
vectors, as your plugin does already.
>From what I can see in your plugin, it uses in-memory and neo4j. The plugin I
>was working on was going to save to Milvus.
This plugin would be similar to existing output HOP transforms in that it
receives inputs (vectors) and persists them somewhere designed for vectors.
The in-memory option probably won't make much sense here, but it can in plugin
4.
## Plugin 3 (general concept): The inverse of plugin 2, the input transform.
Inputs: file location or address, depending on the database type.
Outputs: array of numbers.
Similar to HOP's Text File, json, parquet, DBs, it reads from a source and
outputs the values from that source. I haven't used Neo4J's option so not sure
if that can be reused, but for other sources such as Milvus (or whatever vector
database), something would need to be created. All that is not out of the box
in HOP (e.g. parquet, text, etc.) would fall under this plugin.
## Plugin 4: Search/Similarity/etc.
Inputs: embedding ids, Array of numbers, model details, and perhaps searching
parameters.
Outputs: Generated records containing the relevant matches along with their
scores.
This is pretty much what your code does:
```
private List<Object[]> getFromVector(Object[] keyRow) throws HopValueException
{
...
String lookupValueString = getInputRowMeta().getString(keyRow,
data.indexOfMainField);
Embedding queryEmbedding =
data.embeddingModel.embed(lookupValueString).content();
EmbeddingSearchRequest esr = new EmbeddingSearchRequest(queryEmbedding,
data.maxResults, null, null);
EmbeddingSearchResult<TextSegment> relevant =
data.embeddingStore.search(esr);
List<Object[]> retList = new ArrayList<Object[]>(data.maxResults);
for (EmbeddingMatch<TextSegment> embeddingMatch : relevant.matches()) {
String key = embeddingMatch.embeddingId();
Double score = embeddingMatch.score();
Object[] matchedRow = data.look.get(key);
Object resultKey = matchedRow[SemanticSearchData.KEY_INDEX];
String value = (String) matchedRow[SemanticSearchData.VALUE_INDEX];
...
}
```
As for cosine similarity and t-SNE, as @usbrandon mentioned, I would leave
those out. Linear methods such as PCA and non-linear like t-SNE etc. are other
forms of (pre/post) processing steps (e.g. analysis, dimensionality reduction)
and perhaps left for another HOP plugin or an external process such as python
to handle. For example, those can be done against the embeddings dataset that
was produced by plugin 1 via python. Cosine similarity is one of the most
common use functions in that it measures the direction of the vectors.
However, there are other similarity search functions (euclidean, dot product,
etc.) and I would leave that up to the vector database or external tool to
handle. They'll do a better job than something implemented in HOP. For
example, they might use the GPU for better performance (e.g. Meta's Faiss).
However, plugin 4 could allow for extra parameters that are passed to the
database/tool (e.g. to use a different function).
Regarding in-memory, it can be tricky for it to scale if it's large, but still
serves a useful purpose. Your plugin already has it and there are other forms
as well. I was trying to fit Meta's Faiss into HOP, but abandoned it and
instead used it as an in-memory similarity search (using euclidean distance)
via Python. Plugin 4 could allow to read an input stream into memory and then
do the searching using the values from another input stream. Think HOP's
stream lookup plugin does something like this in that it reads into memory one
stream and then uses that to process the values (lookups) of another stream.
GitHub link:
https://github.com/apache/hop/discussions/4732#discussioncomment-11718595
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]