* Ihor Radchenko <yanta...@posteo.net> [2025-05-04 12:58]: > Jean Louis <bugs@gnu.support> writes: > > >> Do you parse Org files for chunking? What is the chunking strategy? > > > > Yes, I parse by headings. That may not be as best. > > ... > > (headings (rcd-org-get-headings-with-contents)) > > ... > > (input (concat heading-text "\n" contents)) > > (embeddings (rcd-llm-get-embedding input nil "search_document: > > "))) > > ... > > (contents (when (org-element-property :contents-begin hl) > > (buffer-substring-no-properties > > (org-element-property :contents-begin hl) > > (org-element-property :contents-end hl))))) > > So, it seems that you are including the whole subtree under heading and > then split the text into fixed size chunks.
Something like that, thanks for observation, I missed the message. > AFAIU, that's not the best strategy, and you may cut the chunks abruptly > in the middle of headings/sentence. You may consider something like > https://python.langchain.com/docs/how_to/recursive_text_splitter/ > Since you can work with AST, it will be trivial to split things all the > way down to paragraph level and then split the paragraphs by sentences > (if that is necessary). > > Using meaningful chunking tends to improve vector search and LLM > performance _a lot_. Thanks much for observation, I find it crucial detail for the future of computing. Vectors will be in use for quite some time, until something new and better get discovered. Here is my rcd-semantic-split-server.py that runs in memory since months, and I have so far no visible practical issues. Surely it is not perfect. Though there are some overlaps, you can see there. Splitting to the level of sentence would mean that I wish to link to the level of the sentence, and yet we do not have such perfect functions in Emacs yet. I am using it in the RCD Notes & Hyperscope, The Dynamic Knowledge Repository for GNU Emacs, that is on meta level. The system works well, daily it generates semantic links automatically and thus improves overall sales. To me everything must be practical. from fastapi import Body, FastAPI, UploadFile, File, HTTPException from fastapi.responses import JSONResponse import tiktoken import re from typing import List, Dict app = FastAPI() # Constants MAX_INPUT_LENGTH = 1000000 # ~1MB of text BATCH_SIZE = 100000 # Increased batch size for better performance # Pre-compile regex patterns for better performance REPEAT_CHARS = re.compile(r'(.)\1{2,}') # For chars like ---, === BOX_CHARS = re.compile(r'[─━│┃┄┅┆┇┈┉┊┋╌╍╎╏═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬]+') def clean_text(text: str) -> str: """Clean text without any HTML parsing""" # Reduce repetitive characters (3+ repeats down to 3) text = REPEAT_CHARS.sub(r'\1\1\1', text) # Replace box-drawing characters with simple dashes text = BOX_CHARS.sub('---', text) # Normalize whitespace return ' '.join(text.split()) def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> List[Dict]: """Efficient chunking with token awareness""" enc = tiktoken.get_encoding("cl100k_base") tokens = enc.encode(text) chunks = [] for i in range(0, len(tokens), max_tokens - overlap): chunk_tokens = tokens[i:i + max_tokens] chunks.append({ "text": enc.decode(chunk_tokens), "tokens": len(chunk_tokens), "start_token": i, "end_token": i + len(chunk_tokens) }) return chunks @app.post("/chunk") async def chunk_file( file: UploadFile = File(...), max_tokens: int = 512, overlap: int = 50 ): if not file.content_type.startswith('text/'): raise HTTPException(400, "Only text files accepted") try: text = (await file.read()).decode('utf-8') if len(text) > MAX_INPUT_LENGTH: raise HTTPException(413, f"Input too large. Max {MAX_INPUT_LENGTH} chars allowed") cleaned_text = clean_text(text) chunks = chunk_text(cleaned_text, max_tokens, overlap) return JSONResponse({ "filename": file.filename, "total_chunks": len(chunks), "chunks": chunks }) except Exception as e: raise HTTPException(500, f"Processing error: {str(e)}") @app.post("/chunk_text") async def chunk_raw_text( text: str = Body(..., embed=True), max_tokens: int = Body(512), overlap: int = Body(50) ): try: if len(text) > MAX_INPUT_LENGTH: raise HTTPException(413, f"Input too large. Max {MAX_INPUT_LENGTH} chars allowed") cleaned_text = clean_text(text) chunks = chunk_text(cleaned_text, max_tokens, overlap) return JSONResponse({ "total_chunks": len(chunks), "chunks": chunks }) except Exception as e: raise HTTPException(500, f"Error: {str(e)}") if __name__ == "__main__": import uvicorn uvicorn.run(app, host="0.0.0.0", port=8201) -- Jean Louis