* Ihor Radchenko <yanta...@posteo.net> [2025-05-04 12:58]:
> Jean Louis <bugs@gnu.support> writes:
> 
> >> Do you parse Org files for chunking? What is the chunking strategy?
> >
> > Yes, I parse by headings. That may not be as best.
> > ...
> >          (headings (rcd-org-get-headings-with-contents))
> > ...
> >              (input (concat heading-text "\n" contents))
> >              (embeddings (rcd-llm-get-embedding input nil "search_document: 
> > ")))
> > ...
> >                (contents (when (org-element-property :contents-begin hl)
> >                        (buffer-substring-no-properties
> >                             (org-element-property :contents-begin hl)
> >                             (org-element-property :contents-end hl)))))
> 
> So, it seems that you are including the whole subtree under heading and
> then split the text into fixed size chunks.

Something like that, thanks for observation, I missed the message.

> AFAIU, that's not the best strategy, and you may cut the chunks abruptly
> in the middle of headings/sentence. You may consider something like
> https://python.langchain.com/docs/how_to/recursive_text_splitter/
> Since you can work with AST, it will be trivial to split things all the
> way down to paragraph level and then split the paragraphs by sentences
> (if that is necessary).
> 
> Using meaningful chunking tends to improve vector search and LLM
> performance _a lot_.

Thanks much for observation, I find it crucial detail for the future
of computing. Vectors will be in use for quite some time, until
something new and better get discovered.

Here is my rcd-semantic-split-server.py that runs in memory since
months, and I have so far no visible practical issues. Surely it is
not perfect.

Though there are some overlaps, you can see there. 

Splitting to the level of sentence would mean that I wish to link to
the level of the sentence, and yet we do not have such perfect
functions in Emacs yet.

I am using it in the RCD Notes & Hyperscope, The Dynamic Knowledge
Repository for GNU Emacs, that is on meta level. The system works
well, daily it generates semantic links automatically and thus
improves overall sales. To me everything must be practical.

from fastapi import Body, FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import tiktoken
import re
from typing import List, Dict

app = FastAPI()

# Constants
MAX_INPUT_LENGTH = 1000000  # ~1MB of text
BATCH_SIZE = 100000  # Increased batch size for better performance

# Pre-compile regex patterns for better performance
REPEAT_CHARS = re.compile(r'(.)\1{2,}')  # For chars like ---, ===
BOX_CHARS = re.compile(r'[─━│┃┄┅┆┇┈┉┊┋╌╍╎╏═║╒╓╔╕╖╗╘╙╚╛╜╝╞╟╠╡╢╣╤╥╦╧╨╩╪╫╬]+')

def clean_text(text: str) -> str:
    """Clean text without any HTML parsing"""
    # Reduce repetitive characters (3+ repeats down to 3)
    text = REPEAT_CHARS.sub(r'\1\1\1', text)
    
    # Replace box-drawing characters with simple dashes
    text = BOX_CHARS.sub('---', text)
    
    # Normalize whitespace
    return ' '.join(text.split())

def chunk_text(text: str, max_tokens: int = 512, overlap: int = 50) -> 
List[Dict]:
    """Efficient chunking with token awareness"""
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    
    for i in range(0, len(tokens), max_tokens - overlap):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append({
            "text": enc.decode(chunk_tokens),
            "tokens": len(chunk_tokens),
            "start_token": i,
            "end_token": i + len(chunk_tokens)
        })
    
    return chunks

@app.post("/chunk")
async def chunk_file(
    file: UploadFile = File(...),
    max_tokens: int = 512,
    overlap: int = 50
):
    if not file.content_type.startswith('text/'):
        raise HTTPException(400, "Only text files accepted")
    
    try:
        text = (await file.read()).decode('utf-8')
        if len(text) > MAX_INPUT_LENGTH:
            raise HTTPException(413, f"Input too large. Max {MAX_INPUT_LENGTH} 
chars allowed")
        
        cleaned_text = clean_text(text)
        chunks = chunk_text(cleaned_text, max_tokens, overlap)
        return JSONResponse({
            "filename": file.filename,
            "total_chunks": len(chunks),
            "chunks": chunks
        })
    except Exception as e:
        raise HTTPException(500, f"Processing error: {str(e)}")

@app.post("/chunk_text")
async def chunk_raw_text(
    text: str = Body(..., embed=True),
    max_tokens: int = Body(512),
    overlap: int = Body(50)
):
    try:
        if len(text) > MAX_INPUT_LENGTH:
            raise HTTPException(413, f"Input too large. Max {MAX_INPUT_LENGTH} 
chars allowed")
        
        cleaned_text = clean_text(text)
        chunks = chunk_text(cleaned_text, max_tokens, overlap)
        return JSONResponse({
            "total_chunks": len(chunks),
            "chunks": chunks
        })
    except Exception as e:
        raise HTTPException(500, f"Error: {str(e)}")

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8201)

-- 
Jean Louis

Reply via email to