Thanks for sharing your approach!
Do you already have some code to share?
Today I read about https://github.com/infiniflow/ragflow which might
also have some interesting chunking approaches.
Thanks
Michael
Am 09.04.24 um 01:25 schrieb Nick Burch:
On Mon, 8 Apr 2024, Tim Allison wrote:
Not sure we should jump on the bandwagon, but anything we can do to
support smart chunking would benefit us.
Could just be more integrations with parsers that turn out to be
useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse
I played around with chunking a bit late last year, but owing to not
getting any of the AI jobs I went for, I didn't get it beyond a rough
protype. I can say that most people are doing a terrible job in their
out-of-the box configs...
My current suggested (but not fully tested) approach is:
* Define a range of chunk sizes that you'd like (min / ideal / max)
* Parse as XHTML with Tika
* Keep track of headings and table headers
* Break on headings
* If a chunk is too big, break on other elements (eg div or p)
* If a chunk is too small, and near other small chunks, join them
* Include 1-2 headings above the current one at the top,
as a targetted bit of Table of Contents. (eg chunk starts on H3, put
the H2 in as well)
* If you broke up a huge table, repeat the table headers at the
start of every chunk
* When you're done chunking + adding bits back at the top, convert
to markdown on output
Happy to explain more! But sadly lacking time right now to do much on
that
Nick