Re: Document chunking

Michael Wechner Tue, 09 Apr 2024 02:48:42 -0700

Thanks for sharing your approach!

Do you already have some code to share?

Today I read about https://github.com/infiniflow/ragflow which mightalso have some interesting chunking approaches.


Thanks

Michael

Am 09.04.24 um 01:25 schrieb Nick Burch:

On Mon, 8 Apr 2024, Tim Allison wrote:
Not sure we should jump on the bandwagon, but anything we can do tosupport smart chunking would benefit us.
Could just be more integrations with parsers that turn out to beuseful. I
haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
https://github.com/Filimoa/open-parse
I played around with chunking a bit late last year, but owing to notgetting any of the AI jobs I went for, I didn't get it beyond a roughprotype. I can say that most people are doing a terrible job in theirout-of-the box configs...
My current suggested (but not fully tested) approach is:
 * Define a range of chunk sizes that you'd like (min / ideal / max)
 * Parse as XHTML with Tika
 * Keep track of headings and table headers
 * Break on headings
 * If a chunk is too big, break on other elements (eg div or p)
 * If a chunk is too small, and near other small chunks, join them
 * Include 1-2 headings above the current one at the top,
   as a targetted bit of Table of Contents. (eg chunk starts on H3, put
   the H2 in as well)
 * If you broke up a huge table, repeat the table headers at the
   start of every chunk
 * When you're done chunking + adding bits back at the top, convert
   to markdown on output
Happy to explain more! But sadly lacking time right now to do much onthat
Nick

Re: Document chunking

Reply via email to