Hi Greg, Have you tried experimenting with the ADDCACHE[1] option when building your database? While it's been a bit, I recall having good results with, especially in a RAM-constrained environment. Hope that's helpful! Best, Bridger
[1] https://docs.basex.org/wiki/Options#ADDCACHE On Thu, Mar 14, 2024 at 9:55 PM Murray, Gregory <gregory.mur...@ptsem.edu> wrote: > Thanks, Christian. I don’t think selective indexing is applicable in my > use case, because I need to perform full-text searches on the entirety of > each document. Each XML document represents a physical book that was > digitized, and the structure of each document is essentially a header with > metadata and a body with the OCR text of the book. The OCR text is split > into pages, where one <page> element contains all the words from one > corresponding printed page from the physical book. Obviously the number of > words in each <page> varies widely based on the physical dimensions of the > book and the typeface. > > > > So far, I have loaded 12,331 documents, containing a total of 2,196,771 > pages. The total size of those XML documents on disk is 4.7GB. But that is > only a fraction of the total number of documents I want to load into BaseX. > The total number is more like 160,000 documents. Assuming that the > documents I’ve loaded so far are a representative sample, and I believe > that’s true, then the total size of the XML documents on disk, prior to > loading them into BaseX, would be about 4.7GB * 13 = 61.1GB. > > > > Normally the OCR text, once loaded, almost never changes. But the metadata > fields do change as corrections are made. Also we add more XML documents > routinely as we digitize more books over time. Therefore updates and > additions are commonplace, such that keeping indexes up to date is > important, to allow full-text searches to stay performant. I’m wondering if > there are techniques for optimizing such quantities of text. > > > > Thanks, > > Greg > > > > *From: *Christian Grün <christian.gr...@gmail.com> > *Date: *Thursday, March 14, 2024 at 8:48 AM > *To: *Murray, Gregory <gregory.mur...@ptsem.edu> > *Cc: *basex-talk@mailman.uni-konstanz.de < > basex-talk@mailman.uni-konstanz.de> > *Subject: *Re: [basex-talk] Out of Main Memory > > Hi Greg, > > > > A quick reply: If only parts of your documents are relevant for full-text > queries, you can restrict the selection with the FTINDEX option (see [1] > for more information). > > > > How large is the total size of your input documents? > > > > Best, > > Christian > > > > [1] https://docs.basex.org/wiki/Indexes#Selective_Indexing > > > > > > > > On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.mur...@ptsem.edu> > wrote: > > Hello, > > > > I’m working with a database that has a full-text index. I have found that > if I iteratively add XML documents, then optimize, add more documents, > optimize again, and so on, eventually the “optimize” command will fail with > “Out of Main Memory.” I edited the basex startup script to change the > memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, > but of course the OS uses up some of it. I have found that if I exit > memory-hungry programs (web browser, Oxygen), start basex, and then run the > “optimize” command, I still get “Out of Main Memory.” I’m wondering if > there are any known workarounds or strategies for this situation. If I > understand the documentation about indexes correctly, index data is > periodically written to disk during optimization. Does this mean that > running optimize again will pick up where the previous attempt left off, > such that running optimize repeatedly will eventually succeed? > > > > Thanks, > > Greg > > > > > > Gregory Murray > > Director of Digital Initiatives > > Wright Library > > Princeton Theological Seminary > > > > > >