Re: Document chunking

2024-04-09 Thread Tim Allison
My 0.02... 1) It is important that we do what we can to make it easy for people to integrate Tika into the dense vector/llm/rag landscape. I see A LOT of projects reinventing the wheel (without multi-parser full recursion like we have), or just running pdftotext and declaring victory. So, if we

Re: Document chunking

2024-04-09 Thread Eric Pugh
Your approach sounds great as well Nick…. > On Apr 9, 2024, at 2:21 AM, Michael Wechner wrote: > > Thanks for sharing your approach! > > Do you already have some code to share? > > Today I read about https://github.com/infiniflow/ragflow which might also > have some interesting chunking

Re: Document chunking

2024-04-09 Thread Michael Wechner
Thanks for sharing your approach! Do you already have some code to share? Today I read about https://github.com/infiniflow/ragflow which might also have some interesting chunking approaches. Thanks Michael Am 09.04.24 um 01:25 schrieb Nick Burch: On Mon, 8 Apr 2024, Tim Allison wrote: Not

Re: Document chunking

2024-04-08 Thread Nick Burch
On Mon, 8 Apr 2024, Tim Allison wrote: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated

Re: Document chunking

2024-04-08 Thread Nicholas DiPiazza
I am also very interested in this vector-based search. Indexes are a big thing right now. On Mon, Apr 8, 2024, 4:16 PM Michael Wechner wrote: > It would be great to have good "semantic chunking" in order to generate > vector embeddings. > > Thanks for the link below, will try to test it. > >

Re: Document chunking

2024-04-08 Thread Michael Wechner
It would be great to have good "semantic chunking" in order to generate vector embeddings. Thanks for the link below, will try to test it. Thanks Michael Am 08.04.24 um 18:29 schrieb Tim Allison: Not sure we should jump on the bandwagon, but anything we can do to support smart chunking

Document chunking

2024-04-08 Thread Tim Allison
Not sure we should jump on the bandwagon, but anything we can do to support smart chunking would benefit us. Could just be more integrations with parsers that turn out to be useful. I haven’t had much joy with some. Here’s one that I haven’t evaluated yet: https://github.com/Filimoa/open-parse