On Mon, 8 Apr 2024, Tim Allison wrote:
Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us.

Could just be more integrations with parsers that turn out to be useful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse

I played around with chunking a bit late last year, but owing to not getting any of the AI jobs I went for, I didn't get it beyond a rough protype. I can say that most people are doing a terrible job in their out-of-the box configs...

My current suggested (but not fully tested) approach is:
 * Define a range of chunk sizes that you'd like (min / ideal / max)
 * Parse as XHTML with Tika
 * Keep track of headings and table headers
 * Break on headings
 * If a chunk is too big, break on other elements (eg div or p)
 * If a chunk is too small, and near other small chunks, join them
 * Include 1-2 headings above the current one at the top,
   as a targetted bit of Table of Contents. (eg chunk starts on H3, put
   the H2 in as well)
 * If you broke up a huge table, repeat the table headers at the
   start of every chunk
 * When you're done chunking + adding bits back at the top, convert
   to markdown on output

Happy to explain more! But sadly lacking time right now to do much on that

Nick

Reply via email to