(incubator-hugegraph-ai) branch main updated: feat(llm): support semi-automated generated graph schema (#274)

jin Mon, 07 Jul 2025 06:17:08 -0700

This is an automated email from the ASF dual-hosted git repository.

jin pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-hugegraph-ai.git



The following commit(s) were added to refs/heads/main by this push:
     new b51717c  feat(llm): support semi-automated generated graph schema 
(#274)
b51717c is described below

commit b51717c214bd53f2bcc50594318698e7e6e1e40c
Author: Gearless <[email protected]>
AuthorDate: Mon Jul 7 21:15:46 2025 +0800

    feat(llm): support semi-automated generated graph schema (#274)
    
    ## Overview
    This PR implements a semi‑automated graph schema generation feature. The
    system collects raw data samples provided by the user, along with
    user‑provided or default built‑in Query Examples and Few‑Shot templates,
    sends them to the LLM to generate an initial schema draft for user
    reference. After users review and adjust the draft in the UI, the final
    Graph Schema is applied to the HugeGraph instance.
    
    ## Main Changes
    1. **New Operator: `schema_builder.py`**
    - Adds `SchemaBuilder` operator responsible for prompt construction,
    invoking the LLM, and parsing the returned schema JSON.
    2. **Built‑in Prompt Configuration: `prompt_config.py`**
    - Preloads default Query Examples and Few‑Shot schema templates in
    `prompt_config.py`; users can directly use these default templates for
    schema generation.
    3. **UI & Workflow Updates**
    - Updates files such as `vector_graph_block.py` and
    `kg_construction_task.py` to add a collapsible "Advanced Schema Options"
    section in the “Build RAG Index” module for triggering the
    semi‑automated graph schema generation.
    
    ---------
    
    Co-authored-by: imbajin <[email protected]>
---
 .../src/hugegraph_llm/config/prompt_config.py      |  52 ++++----
 .../demo/rag_demo/vector_graph_block.py            |  64 +++++++++-
 .../operators/kg_construction_task.py              |   5 +
 .../hugegraph_llm/operators/llm_op/schema_build.py | 137 +++++++++++++++++++++
 .../resources/prompt_examples/query_examples.json  |   9 ++
 .../resources/prompt_examples/schema_examples.json |  47 +++++++
 .../src/hugegraph_llm/utils/graph_index_utils.py   |  41 ++++++
 7 files changed, 326 insertions(+), 29 deletions(-)

diff --git a/hugegraph-llm/src/hugegraph_llm/config/prompt_config.py 
b/hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
index 547cda9..abc1c18 100644
--- a/hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
+++ b/hugegraph-llm/src/hugegraph_llm/config/prompt_config.py
@@ -274,25 +274,25 @@ and experiences.
 - 边：[边标签、源顶点标签、目标顶点标签及其属性列表]
 
 ### 内容规则
-请仔细阅读提供的文本，识别与模式中定义的顶点和边相对应的信息。对于每一条匹配顶点或边的信息，按以下JSON结构格式化：
+请仔细阅读提供的文本，识别与模式中定义的顶点和边相对应的信息。对于每一条匹配顶点或边的信息，按以下 JSON 结构格式化：
 
 #### 顶点格式：
-{"id":"顶点标签ID:实体名称","label":"顶点标签","type":"vertex","properties":{"属性名":"属性值", 
...}}
+{"id":"顶点标签 ID:实体名称","label":"顶点标签","type":"vertex","properties":{"属性名":"属性值", 
...}}
 
 #### 边格式：
-{"label":"边标签","type":"edge","outV":"源顶点ID","outVLabel":"源顶点标签","inV":"目标顶点ID","inVLabel":"目标顶点标签","properties":{"属性名":"属性值",...}}
+{"label":"边标签","type":"edge","outV":"源顶点 ID","outVLabel":"源顶点标签","inV":"目标顶点 
ID","inVLabel":"目标顶点标签","properties":{"属性名":"属性值",...}}
 
 同时遵循以下规则：
 1. 不要提取给定模式中不存在的属性字段或标签
 2. 确保提取的属性集与给定模式类型一致（如'age'应为数字，'select'应为布尔值）
-3. 如果有多个主键，生成VID的策略是：顶点标签ID:pk1!pk2!pk3（pk表示主键，'!'是分隔符）
-4. 以JSON格式输出，仅包含顶点和边，移除空属性，基于文本/规则和模式提取和格式化
+3. 如果有多个主键，生成 VID 的策略是：顶点标签 ID:pk1!pk2!pk3（pk 表示主键，'!'是分隔符）
+4. 以 JSON 格式输出，仅包含顶点和边，移除空属性，基于文本/规则和模式提取和格式化
 5. 如果给定文本为中文但模式为英文，则将模式字段翻译成中文（可选）
 
 ## 示例
 ### 输入示例：
 #### 文本
-认识Sarah，一位30岁的律师，和她的室友James，他们从2010年开始合住。James在职业生活中是一名记者。
+认识 Sarah，一位 30 岁的律师，和她的室友 James，他们从 2010 年开始合住。James 在职业生活中是一名记者。
 
 #### 图谱模式
 
{"vertices":[{"vertex_label":"person","properties":["name","age","occupation"]}],
 "edges":[{"edge_label":"roommate", 
"source_vertex_label":"person","target_vertex_label":"person","properties":["date"]]}
@@ -302,7 +302,7 @@ and experiences.
 """
 
     gremlin_generate_prompt_CN: str = """
-你是图查询语言（Gremlin）的专家。你的角色是理解图谱的模式，识别用户查询背后的意图，并根据给定的指令生成准确的Gremlin代码。
+你是图查询语言（Gremlin）的专家。你的角色是理解图谱的模式，识别用户查询背后的意图，并根据给定的指令生成准确的 Gremlin 代码。
 
 ### 任务
 ## 复杂查询检测：
@@ -320,19 +320,19 @@ and experiences.
 # 规则
 - **复杂查询处理**：
     - **检测**：如果用户的查询符合上述任一复杂性标准，则视为**复杂**查询。
-    - **响应**：对于复杂查询，**不要**进行Gremlin查询生成。相反，直接返回以下Gremlin查询：
+    - **响应**：对于复杂查询，**不要**进行 Gremlin 查询生成。相反，直接返回以下 Gremlin 查询：
     ```gremlin
     g.V().limit(0)
     ```
 - **简单查询处理**：
     - 如果查询**不**符合任何复杂性标准，则视为**简单**查询。
-    - 按照下面的说明进行Gremlin查询生成任务。
+    - 按照下面的说明进行 Gremlin 查询生成任务。
 
-## Gremlin查询生成（仅在查询不复杂时执行）：
+## Gremlin 查询生成（仅在查询不复杂时执行）：
 # 规则
-- 如果在上下文中提供了顶点ID，可以直接使用。
-- 如果提供的问题包含与顶点ID非常相似的实体名称，则在生成的Gremlin语句中替换原始问题中的近似实体。
-例如，如果问题包含名称ABC，而提供的顶点ID不包含ABC而只有abC，则在生成gremlin时使用abC而不是原始问题中的ABC。
+- 如果在上下文中提供了顶点 ID，可以直接使用。
+- 如果提供的问题包含与顶点 ID 非常相似的实体名称，则在生成的 Gremlin 语句中替换原始问题中的近似实体。
+例如，如果问题包含名称 ABC，而提供的顶点 ID 不包含 ABC 而只有 abC，则在生成 gremlin 时使用 abC 而不是原始问题中的 ABC。
 
 输出格式必须如下：
 ```gremlin
@@ -340,18 +340,18 @@ g.V().limit(10)
 ```
 图谱模式：
 {schema}
-参考Gremlin示例对：
+参考 Gremlin 示例对：
 {example}
 
-与查询相关的已提取顶点ID：
+与查询相关的已提取顶点 ID：
 {vertices}
 
-从以下用户查询生成Gremlin：
+从以下用户查询生成 Gremlin：
 {query}
 
 **重要提示：请勿输出任何分析、推理步骤、解释或其他文本。仅返回用 ```gremlin``` 标记包装的 Gremlin 查询。**
 
-生成的Gremlin是：
+生成的 Gremlin 是：
 """
 
     keywords_extract_prompt_CN: str = """指令：
@@ -368,24 +368,24 @@ g.V().limit(10)
 要求：
 - 关键词应为有意义且具体的实体，避免使用无意义或过于宽泛的词语，或单字符的词（例如：“物品”、“动作”、“效果”、“作用”、“的”、“他”）。
 - 优先提取主语、动词和宾语，避免提取虚词或助词。
-- 保持语义完整性： 抽取的关键词应尽量保持关键词在原语境中语义和信息的完整性（例如：“苹果电脑”应作为一个整体被抽取，而不是被分为“苹果”和“电脑”）。
-- 避免泛化： 不要扩展为不相关的泛化类别。
+- 保持语义完整性：抽取的关键词应尽量保持关键词在原语境中语义和信息的完整性（例如：“苹果电脑”应作为一个整体被抽取，而不是被分为“苹果”和“电脑”）。
+- 避免泛化：不要扩展为不相关的泛化类别。
 注意：
-- 仅考虑语境相关的同义词： 只需考虑给定语境下的关键词的语义近义词和具有类似含义的其他词语。
-- 调整关键词长度： 
如果关键词相对宽泛，可以根据语境适当增加单个关键词的长度（例如：“违法行为”可以作为一个单独的关键词被抽取，或抽取为“违法”，但不应拆分为“违法”和“行为”）。
+- 仅考虑语境相关的同义词：只需考虑给定语境下的关键词的语义近义词和具有类似含义的其他词语。
+- 
调整关键词长度：如果关键词相对宽泛，可以根据语境适当增加单个关键词的长度（例如：“违法行为”可以作为一个单独的关键词被抽取，或抽取为“违法”，但不应拆分为“违法”和“行为”）。
 输出格式：
-- 仅输出一行内容, 以 KEYWORDS: 为前缀，后跟所有关键词或对应的同义词，之间用逗号分隔。抽取的关键词中不允许出现空格或空字符
+- 仅输出一行内容，以 KEYWORDS: 为前缀，后跟所有关键词或对应的同义词，之间用逗号分隔。抽取的关键词中不允许出现空格或空字符
 - 格式示例：
-KEYWORDS:关键词1,关键词2,...,关键词n
+KEYWORDS:关键词 1，关键词 2,...,关键词 n
 
 MAX_KEYWORDS: {max_keywords}
 文本：
 {question}
 """
 
-    doc_input_text_CN: str = 
"""介绍一下Sarah，她是一位30岁的律师，还有她的室友James，他们从2010年开始一起合租。James是一名记者，
-职业道路也很出色。另外，Sarah拥有一个个人网站www.sarahsplace.com，而James也经营着自己的网页，不过这里没有提到具体的网址。这两个人，
-Sarah和James，不仅建立起了深厚的室友情谊，还各自在网络上开辟了自己的一片天地，展示着他们各自丰富多彩的兴趣和经历。
+    doc_input_text_CN: str = """介绍一下 Sarah，她是一位 30 岁的律师，还有她的室友 James，他们从 2010 
年开始一起合租。James 是一名记者，
+职业道路也很出色。另外，Sarah 拥有一个个人网站 www.sarahsplace.com，而 James 
也经营着自己的网页，不过这里没有提到具体的网址。这两个人，
+Sarah 和 James，不仅建立起了深厚的室友情谊，还各自在网络上开辟了自己的一片天地，展示着他们各自丰富多彩的兴趣和经历。
 """
 
     generate_extract_prompt_template: str = """## Your Role
diff --git 
a/hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py 
b/hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py
index c353303..1d5e8a9 100644
--- a/hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py
+++ b/hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py
@@ -34,7 +34,7 @@ from hugegraph_llm.utils.graph_index_utils import (
     clean_all_graph_data,
     update_vid_embedding,
     extract_graph,
-    import_graph_data,
+    import_graph_data, build_schema,
 )
 from hugegraph_llm.utils.hugegraph_utils import check_graph_db_connection
 from hugegraph_llm.utils.log import log
@@ -84,6 +84,25 @@ def load_example_names():
     except (FileNotFoundError, json.JSONDecodeError):
         return ["No available examples"]
 
+def load_query_examples():
+    """Load query examples from JSON file"""
+    try:
+        examples_path = os.path.join(resource_path, "prompt_examples", 
"query_examples.json")
+        with open(examples_path, 'r', encoding='utf-8') as f:
+            examples = json.load(f)
+        return json.dumps(examples, indent=2)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return "[]"
+
+def load_schema_fewshot_examples():
+    """Load few-shot examples from a JSON file"""
+    try:
+        examples_path = os.path.join(resource_path, "prompt_examples", 
"schema_examples.json")
+        with open(examples_path, 'r', encoding='utf-8') as f:
+            examples = json.load(f)
+        return json.dumps(examples, indent=2)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return "[]"
 
 def update_example_preview(example_name):
     """Update the display content based on the selected example name."""
@@ -103,9 +122,8 @@ def update_example_preview(example_name):
         log.warning("Could not update example preview: %s", e)
     return "", "", ""
 
-
 def _create_prompt_helper_block(demo, input_text, info_extract_template):
-    with gr.Accordion("Assist in generating graph extraction prompts", 
open=True):
+    with gr.Accordion("Graph Extraction Prompt Generator", open=False):
         gr.Markdown(
             "Provide your **original text** and **expected scenario**, "
             "then select a reference example to generate a high-quality graph 
extraction prompt."
@@ -154,6 +172,13 @@ def _create_prompt_helper_block(demo, input_text, 
info_extract_template):
         )
 
 
+def _build_schema_and_provide_feedback(input_text, query_example, few_shot):
+    gr.Info("Generating schema, please wait...")
+    # Call the original build_schema function
+    generated_schema = build_schema(input_text, query_example, few_shot)
+    gr.Info("Schema generated successfully!")
+    return generated_schema
+
 def create_vector_graph_block():
     # pylint: disable=no-member
     # pylint: disable=C0301
@@ -213,8 +238,31 @@ def create_vector_graph_block():
             graph_extract_bt = gr.Button("Extract Graph Data (1)", 
variant="primary")
             graph_loading_bt = gr.Button("Load into GraphDB (2)", 
interactive=True)
             graph_index_rebuild_bt = gr.Button("Update Vid Embedding")
+
         gr.Markdown("---")
+        with gr.Accordion("Graph Schema Generator", open=False):
+            gr.Markdown(
+                "Provide **query examples** and **few-shot examples**, "
+                "then click **Generate Schema** to automatically create graph 
schema."
+            )
+            with gr.Row():
+                query_example = gr.Code(
+                    value=load_query_examples(),
+                    label="Query Examples",
+                    language="json",
+                    lines=10,
+                    max_lines=15
+                )
+                few_shot = gr.Code(
+                    value=load_schema_fewshot_examples(),
+                    label="Few-shot Example",
+                    language="json",
+                    lines=10,
+                    max_lines=15
+                )
+                build_schema_bt = gr.Button("Generate Schema", 
variant="primary")
         _create_prompt_helper_block(demo, input_text, info_extract_template)
+
         vector_index_btn0.click(get_vector_index_info, outputs=out).then(
             store_prompt,
             inputs=[input_text, input_schema, info_extract_template],
@@ -255,6 +303,16 @@ def create_vector_graph_block():
             inputs=[input_text, input_schema, info_extract_template],
         )
 
+        # TODO: we should store the examples after the user changed them.
+        build_schema_bt.click(
+            _build_schema_and_provide_feedback,
+            inputs=[input_text, query_example, few_shot],
+            outputs=[input_schema]
+        ).then(
+            store_prompt,
+            inputs=[input_text, input_schema, info_extract_template],  # TODO: 
Store the updated examples
+        )
+
         def on_tab_select(input_f, input_t, evt: gr.SelectData):
             print(f"You selected {evt.value} at {evt.index} from {evt.target}")
             if evt.value == "file":
diff --git a/hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py 
b/hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py
index a736751..4348477 100644
--- a/hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py
+++ b/hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py
@@ -31,6 +31,7 @@ from hugegraph_llm.operators.index_op.build_vector_index 
import BuildVectorIndex
 from hugegraph_llm.operators.llm_op.disambiguate_data import DisambiguateData
 from hugegraph_llm.operators.llm_op.info_extract import InfoExtract
 from hugegraph_llm.operators.llm_op.property_graph_extract import 
PropertyGraphExtract
+from hugegraph_llm.operators.llm_op.schema_build import SchemaBuilder
 from hugegraph_llm.utils.decorators import log_time, log_operator_time, 
record_rpm
 from pyhugegraph.client import PyHugeClient
 
@@ -96,6 +97,10 @@ class KgBuilder:
         self.operators.append(PrintResult())
         return self
 
+    def build_schema(self):
+        self.operators.append(SchemaBuilder(self.llm))
+        return self
+
     @log_time("total time")
     @record_rpm
     def run(self, context: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
diff --git a/hugegraph-llm/src/hugegraph_llm/operators/llm_op/schema_build.py 
b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/schema_build.py
new file mode 100644
index 0000000..5358738
--- /dev/null
+++ b/hugegraph-llm/src/hugegraph_llm/operators/llm_op/schema_build.py
@@ -0,0 +1,137 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+import json
+import re
+from typing import List, Dict, Any, Optional
+
+from hugegraph_llm.models.llms.base import BaseLLM
+from hugegraph_llm.models.llms.init_llm import LLMs
+from hugegraph_llm.utils.log import log
+
+
+class SchemaBuilder:
+    """Automated Schema Generator"""
+
+    def __init__(
+        self,
+        llm: Optional[BaseLLM] = None,
+        schema_prompt: Optional[str] = None,
+    ):
+        self.llm = llm or LLMs().get_chat_llm()
+        # TODO: use a basic format for it
+        self.schema_prompt = schema_prompt or """
+            You are a Graph Schema Generator for Apache HugeGraph.
+            Based on the following three parts of content, output a Schema 
JSON that complies with HugeGraph specifications:
+
+            Inputs:
+            1. Few‐Shot Schema Examples (already formatted as valid HugeGraph 
schema JSON):
+            {few_shot_schema}
+
+            2. Query Examples (each with a question description):
+            {query_examples}
+
+            3. Raw Data Samples (plain text records to model as 
vertices/edges):
+            {raw_texts}
+
+            Constraints:
+            - Return only the JSON object
+            - Ensure the schema follows HugeGraph specifications
+            - Do not include comments or extra fields.
+        """
+
+    def _format_raw_texts(self, raw_texts: List[str]) -> str:
+        return "\n".join([f"- {text}" for text in raw_texts])
+
+    def _format_query_examples(self, query_examples: List[str]) -> str:
+        if not query_examples:
+            return "None"
+        examples = []
+        for example in query_examples:
+            examples.append(f"- {example}")
+        return "\n".join(examples)
+
+    def _format_few_shot_schema(self, few_shot_schema: Dict[str, Any]) -> str:
+        if not few_shot_schema:
+            return "None"
+        return json.dumps(few_shot_schema, indent=2, ensure_ascii=False)
+
+    def _extract_schema(self, response: str) -> Dict[str, Any]:
+        # Try to extract JSON from Markdown code block
+        match = re.search(r"```(?:json)?\s*(.*?)```", response, re.DOTALL)
+        if match:
+            response = match.group(1).strip()
+
+        try:
+            return json.loads(response)
+        except json.JSONDecodeError as e:
+            log.error("Failed to parse LLM response as JSON: %s", response)
+            raise RuntimeError("Invalid JSON response from LLM") from e
+
+    def build_prompt(
+        self,
+        raw_texts: List[str],
+        query_examples: List[Dict[str, str]],
+        few_shot_schema: Dict[str, Any]
+    ) -> str:
+        return self.schema_prompt.format(
+            raw_texts=self._format_raw_texts(raw_texts),
+            query_examples=self._format_query_examples(query_examples),
+            few_shot_schema=self._format_few_shot_schema(few_shot_schema)
+        )
+
+    def run(
+        self,
+        context: Dict[str, Any]
+    ) -> Dict[str, Any]:
+        """Generate schema from context containing raw_texts, query_examples 
and few_shot_schema.
+
+        Args:
+            context: Dictionary containing:
+                - raw_texts: List of raw text samples
+                - query_examples: List of query examples (description + 
Gremlin)
+                - few_shot_schema: Example schema for few-shot learning
+
+        Returns:
+            Generated schema as dictionary
+        """
+        if not isinstance(context, dict):
+            raise ValueError("Context must be a dictionary")
+        if "raw_texts" not in context or not isinstance(context["raw_texts"], 
list):
+            raise ValueError("'raw_texts' must be a list[str]")
+        if "query_examples" not in context or not 
isinstance(context["query_examples"], list):
+            raise ValueError("'query_examples' must be a list[str]")
+        if "few_shot_schema" not in context or not 
isinstance(context["few_shot_schema"], dict):
+            raise ValueError("'few_shot_schema' must be a dict")
+
+        raw_texts = context["raw_texts"]
+        query_examples = context["query_examples"]
+        few_shot_schema = context["few_shot_schema"]
+
+        prompt = self.build_prompt(raw_texts, query_examples, few_shot_schema)
+
+        try:
+            response = self.llm.generate(prompt=prompt)
+            if not response or not response.strip():
+                raise RuntimeError("LLM returned empty response")
+        except Exception as e:
+            log.error("LLM generation failed: %s", str(e))
+            raise RuntimeError(f"Failed to generate schema: {str(e)}") from e
+
+        schema = self._extract_schema(response)
+        log.debug("Generated schema: %s", json.dumps(schema, indent=2))
+        return schema
diff --git 
a/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/query_examples.json 
b/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/query_examples.json
new file mode 100644
index 0000000..1019124
--- /dev/null
+++ 
b/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/query_examples.json
@@ -0,0 +1,9 @@
+[
+  "Property filter: Find all 'person' nodes with age > 30 and return their 
name and occupation",
+  "Relationship traversal: Find all roommates of the person named Alice, and 
return their name and age",
+  "Shortest path: Find the shortest path between Bob and Charlie and show the 
edge labels along the way",
+  "Subgraph match: Find all friend pairs who both follow the same webpage, and 
return the names and URL",
+  "Aggregation: Count the number of people for each occupation and compute 
their average age",
+  "Time-based filter: Find all nodes created after 2025-01-01 and return their 
name and created_at",
+  "Top-N query: List top 10 most visited webpages with their URL and 
visit_count"
+]
diff --git 
a/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/schema_examples.json
 
b/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/schema_examples.json
new file mode 100644
index 0000000..720c70d
--- /dev/null
+++ 
b/hugegraph-llm/src/hugegraph_llm/resources/prompt_examples/schema_examples.json
@@ -0,0 +1,47 @@
+{
+"vertexlabels": [
+    {
+    "id": 1,
+    "name": "person",
+    "id_strategy": "PRIMARY_KEY",
+    "primary_keys": [
+        "name"
+    ],
+    "properties": [
+        "name",
+        "age",
+        "occupation"
+    ]
+    },
+    {
+    "id": 2,
+    "name": "webpage",
+    "id_strategy": "PRIMARY_KEY",
+    "primary_keys": [
+        "name"
+    ],
+    "properties": [
+        "name",
+        "url"
+    ]
+    }
+],
+"edgelabels": [
+    {
+    "id": 1,
+    "name": "roommate",
+    "source_label": "person",
+    "target_label": "person",
+    "properties": [
+        "date"
+    ]
+    },
+    {
+    "id": 2,
+    "name": "link",
+    "source_label": "webpage",
+    "target_label": "person",
+    "properties": []
+    }
+]
+}
diff --git a/hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py 
b/hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py
index bb9e3c8..62a8219 100644
--- a/hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py
+++ b/hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py
@@ -137,3 +137,44 @@ def import_graph_data(data: str, schema: str) -> 
Union[str, Dict[str, Any]]:
         # Note: can't use gr.Error here
         gr.Warning(str(e) + " Please check the graph data format/type 
carefully.")
         return data
+
+def build_schema(input_text, query_example, few_shot):
+    context = {
+        "raw_texts": [input_text] if input_text else [],
+        "query_examples": [],
+        "few_shot_schema": {}
+    }
+
+    if few_shot:
+        try:
+            context["few_shot_schema"] = json.loads(few_shot)
+        except json.JSONDecodeError as e:
+            raise gr.Error(f"Few Shot Schema is not in a valid JSON format: 
{e}") from e
+
+    if query_example:
+        try:
+            parsed_examples = json.loads(query_example)
+            # Validate and retain the description and gremlin fields
+            context["query_examples"] = [
+                {
+                    "description": ex.get("description", ""),
+                    "gremlin": ex.get("gremlin", "")
+                }
+                for ex in parsed_examples
+                if isinstance(ex, dict) and "description" in ex and "gremlin" 
in ex
+            ]
+        except json.JSONDecodeError as e:
+            raise gr.Error(f"Query Examples is not in a valid JSON format: 
{e}") from e
+
+    builder = KgBuilder(LLMs().get_chat_llm(), Embeddings().get_embedding(), 
get_hg_client())
+    try:
+        schema = builder.build_schema().run(context)
+    except Exception as e:
+        log.error("Failed to generate schema: %s", e)
+        raise gr.Error(f"Schema generation failed: {e}") from e
+    try:
+        formatted_schema = json.dumps(schema, ensure_ascii=False, indent=2)
+        return formatted_schema
+    except (TypeError, ValueError) as e:
+        log.error("Failed to format schema: %s", e)
+        return str(schema)

(incubator-hugegraph-ai) branch main updated: feat(llm): support semi-automated generated graph schema (#274)

Reply via email to