This is an automated email from the ASF dual-hosted git repository.
jin pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-hugegraph-ai.git
The following commit(s) were added to refs/heads/main by this push:
new aa83ec2 fix(llm): align regex extraction of json to json format of
prompt (#211)
aa83ec2 is described below
commit aa83ec2a8596ff86ded04e154f32e38252de0574
Author: John <[email protected]>
AuthorDate: Tue Apr 22 11:25:02 2025 +0800
fix(llm): align regex extraction of json to json format of prompt (#211)
See #210
Main change of regex: matching `(\[.*])` -> matching `({.*})`.
tested models:
- qwen-max
- qwen-plus
- deepseek-v3
---------
Co-authored-by: imbajin <[email protected]>
---
.asf.yaml | 1 +
.github/workflows/hugegraph-python-client.yml | 2 +-
hugegraph-llm/README.md | 26 ++++-----
.../operators/llm_op/property_graph_extract.py | 61 ++++++++++++----------
4 files changed, 49 insertions(+), 41 deletions(-)
diff --git a/.asf.yaml b/.asf.yaml
index e92e30b..9fa40a0 100644
--- a/.asf.yaml
+++ b/.asf.yaml
@@ -58,6 +58,7 @@ github:
- HJ-Young
- afterimagex
- returnToInnocence
+ - Thespica
# refer
https://cwiki.apache.org/confluence/display/INFRA/Git+-+.asf.yaml+features#Git.asf.yamlfeatures-Notificationsettingsforrepositories
notifications:
diff --git a/.github/workflows/hugegraph-python-client.yml
b/.github/workflows/hugegraph-python-client.yml
index e05708a..60c84dd 100644
--- a/.github/workflows/hugegraph-python-client.yml
+++ b/.github/workflows/hugegraph-python-client.yml
@@ -20,7 +20,7 @@ jobs:
- name: Prepare HugeGraph Server Environment
run: |
docker run -d --name=graph -p 8080:8080 -e PASSWORD=admin
hugegraph/hugegraph:1.3.0
- sleep 5
+ sleep 10
- uses: actions/checkout@v4
diff --git a/hugegraph-llm/README.md b/hugegraph-llm/README.md
index 1714f08..0251f79 100644
--- a/hugegraph-llm/README.md
+++ b/hugegraph-llm/README.md
@@ -8,12 +8,12 @@ This project includes runnable demos, it can also be used as
a third-party libra
As we know, graph systems can help large models address challenges like
timeliness and hallucination,
while large models can help graph systems with cost-related issues.
-With this project, we aim to reduce the cost of using graph systems, and
decrease the complexity of
+With this project, we aim to reduce the cost of using graph systems and
decrease the complexity of
building knowledge graphs. This project will offer more applications and
integration solutions for
graph systems and large language models.
1. Construct knowledge graph by LLM + HugeGraph
2. Use natural language to operate graph databases (Gremlin/Cypher)
-3. Knowledge graph supplements answer context (GraphRAG -> Graph Agent)
+3. Knowledge graph supplements answer context (GraphRAG → Graph Agent)
## 2. Environment Requirements
> [!IMPORTANT]
@@ -24,7 +24,7 @@ graph systems and large language models.
## 3. Preparation
1. Start the HugeGraph database, you can run it via
[Docker](https://hub.docker.com/r/hugegraph/hugegraph)/[Binary
Package](https://hugegraph.apache.org/docs/download/download/).
- Refer to detailed
[doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev)
for more guidance
+ Refer to a detailed
[doc](https://hugegraph.apache.org/docs/quickstart/hugegraph-server/#31-use-docker-container-convenient-for-testdev)
for more guidance
2. Configuring the poetry environment, Use the official installer to install
Poetry, See the [poetry
documentation](https://poetry.pythonlang.cn/docs/#installing-with-pipx) for
other installation methods
```bash
@@ -32,11 +32,11 @@ graph systems and large language models.
curl -sSL https://install.python-poetry.org | python3 - # install the
latest version like 2.0+
```
-2. Clone this project
+3. Clone this project
```bash
git clone https://github.com/apache/incubator-hugegraph-ai.git
```
-3. Install [hugegraph-python-client](../hugegraph-python-client) and
[hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual
environments
+4. Install [hugegraph-python-client](../hugegraph-python-client) and
[hugegraph_llm](src/hugegraph_llm), poetry officially recommends using virtual
environments
```bash
cd ./incubator-hugegraph-ai/hugegraph-llm
poetry config --list # List/check the current configuration (Optional)
@@ -48,11 +48,11 @@ graph systems and large language models.
poetry shell # use 'exit' to leave the shell
```
If `poetry install` fails or too slow due to network issues, it is
recommended to modify `tool.poetry.source` of `hugegraph-llm/pyproject.toml`
-4. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
+5. Enter the project directory(`./incubator-hugegraph-ai/hugegraph-llm/src`)
```bash
cd ./src
```
-5. Start the gradio interactive demo of **Graph RAG**, you can run with the
following command, and open http://127.0.0.1:8001 after starting
+6. Start the gradio interactive demo of **Graph RAG**, you can run with the
following command and open http://127.0.0.1:8001 after starting
```bash
python -m hugegraph_llm.demo.rag_demo.app # same as "poetry run xxx"
```
@@ -61,23 +61,23 @@ graph systems and large language models.
python -m hugegraph_llm.demo.rag_demo.app --host 127.0.0.1 --port 18001
```
-6. After running the web demo, the config file `.env` will be automatically
generated at the path `hugegraph-llm/.env`. Additionally, a prompt-related
configuration file `config_prompt.yaml` will also be generated at the path
`hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
+7. After running the web demo, the config file `.env` will be automatically
generated at the path `hugegraph-llm/.env`. Additionally, a prompt-related
configuration file `config_prompt.yaml` will also be generated at the path
`hugegraph-llm/src/hugegraph_llm/resources/demo/config_prompt.yaml`.
You can modify the content on the web page, and it will be automatically
saved to the configuration file after the corresponding feature is triggered.
You can also modify the file directly without restarting the web application;
refresh the page to load your latest changes.
(Optional)To regenerate the config file, you can use `config.generate`
with `-u` or `--update`.
```bash
python -m hugegraph_llm.config.generate --update
```
Note: `Litellm` support multi-LLM provider, refer
[litellm.ai](https://docs.litellm.ai/docs/providers) to config it
-7. (__Optional__) You could use
+8. (__Optional__) You could use
[hugegraph-hubble](https://hugegraph.apache.org/docs/quickstart/hugegraph-hubble/#21-use-docker-convenient-for-testdev)
to visit the graph data, could run it via
[Docker/Docker-Compose](https://hub.docker.com/r/hugegraph/hubble)
- for guidance. (Hubble is a graph-analysis dashboard include data
loading/schema management/graph traverser/display).
-8. (__Optional__) offline download NLTK stopwords
+ for guidance. (Hubble is a graph-analysis dashboard that includes data
loading/schema management/graph traverser/display).
+9. (__Optional__) offline download NLTK stopwords
```bash
python ./hugegraph_llm/operators/common_op/nltk_helper.py
```
> [!TIP]
-> You can also refer our [quick-start](./quick_start.md) doc to understand how
to use it & the basic query logic 🚧
+> You can also refer to our [quick-start](./quick_start.md) doc to understand
how to use it & the basic query logic 🚧
## 4 Examples
@@ -124,7 +124,7 @@ This can be obtained from the `LLMs` class.
)
```

-2. **Import Schema**: The `import_schema` method is used to import a schema
from a source. The source can be a HugeGraph instance, a user-defined schema or
an extraction result. The method `print_result` can be chained to print the
result.
+2. **Import Schema**: The `import_schema` method is used to import a schema
from a source. The source can be a HugeGraph instance, a user-defined schema,
or an extraction result. The method `print_result` can be chained to print the
result.
```python
# Import schema from a HugeGraph instance
builder.import_schema(from_hugegraph="xxx").print_result()
diff --git
a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
index 945fd30..faff1c6 100644
--- a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
+++ b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
@@ -26,7 +26,6 @@ from hugegraph_llm.document.chunk_split import ChunkSplitter
from hugegraph_llm.models.llms.base import BaseLLM
from hugegraph_llm.utils.log import log
-
"""
TODO: It is not clear whether there is any other dependence on the
SCHEMA_EXAMPLE_PROMPT variable.
Because the SCHEMA_EXAMPLE_PROMPT variable will no longer change based on
@@ -88,9 +87,9 @@ def filter_item(schema, items) -> List[Dict[str, Any]]:
class PropertyGraphExtract:
def __init__(
- self,
- llm: BaseLLM,
- example_prompt: str = prompt.extract_graph_prompt
+ self,
+ llm: BaseLLM,
+ example_prompt: str = prompt.extract_graph_prompt
) -> None:
self.llm = llm
self.example_prompt = example_prompt
@@ -125,33 +124,41 @@ class PropertyGraphExtract:
return self.llm.generate(prompt=prompt)
def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
- # analyze llm generated text to JSON
- json_strings = re.findall(r'(\[.*?])', text, re.DOTALL)
- longest_json = max(json_strings, key=lambda x: len(''.join(x)),
default=('', ''))
-
- longest_json_str = ''.join(longest_json).strip()
+ # Use regex to extract a JSON object with curly braces
+ json_match = re.search(r'({.*})', text, re.DOTALL)
+ if not json_match:
+ log.critical("Invalid property graph! No JSON object found, "
+ "please check the output format example in prompt.")
+ return []
+ json_str = json_match.group(1).strip()
items = []
try:
- property_graph = json.loads(longest_json_str)
+ property_graph = json.loads(json_str)
+ # Expect property_graph to be a dict with keys "vertices" and
"edges"
+ if not (isinstance(property_graph, dict) and "vertices" in
property_graph and "edges" in property_graph):
+ log.critical("Invalid property graph format; expecting
'vertices' and 'edges'.")
+ return items
+
+ # Create sets for valid vertex and edge labels based on the schema
vertex_label_set = {vertex["name"] for vertex in
schema["vertexlabels"]}
edge_label_set = {edge["name"] for edge in schema["edgelabels"]}
- for item in property_graph:
- if not isinstance(item, dict):
- log.warning("Invalid property graph item type '%s'.",
type(item))
- continue
- if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
- log.warning("Invalid item keys '%s'.", item.keys())
- continue
- if item["type"] == "vertex" or item["type"] == "edge":
- if (item["label"] not in vertex_label_set
- and item["label"] not in edge_label_set):
- log.warning("Invalid '%s' label '%s' has been
ignored.", item["type"], item["label"])
- else:
- items.append(item)
- else:
- log.warning("Invalid item type '%s' has been ignored.",
item["type"])
- except json.JSONDecodeError:
- log.critical("Invalid property graph! Please check the extracted
JSON data carefully")
+ def process_items(item_list, valid_labels, item_type):
+ for item in item_list:
+ if not isinstance(item, dict):
+ log.warning("Invalid property graph item type '%s'.",
type(item))
+ continue
+ if not self.NECESSARY_ITEM_KEYS.issubset(item.keys()):
+ log.warning("Invalid item keys '%s'.", item.keys())
+ continue
+ if item["label"] not in valid_labels:
+ log.warning("Invalid %s label '%s' has been ignored.",
item_type, item["label"])
+ continue
+ items.append(item)
+
+ process_items(property_graph["vertices"], vertex_label_set,
"vertex")
+ process_items(property_graph["edges"], edge_label_set, "edge")
+ except json.JSONDecodeError:
+ log.critical("Invalid property graph JSON! Please check the
extracted JSON data carefully")
return items