Hi Greg,

Have you tried experimenting with the ADDCACHE[1] option when building your
database? While it's been a bit, I recall having good results with,
especially in a RAM-constrained environment.
Hope that's helpful!
Best,
Bridger

[1] https://docs.basex.org/wiki/Options#ADDCACHE

On Thu, Mar 14, 2024 at 9:55 PM Murray, Gregory <gregory.mur...@ptsem.edu>
wrote:

> Thanks, Christian. I don’t think selective indexing is applicable in my
> use case, because I need to perform full-text searches on the entirety of
> each document. Each XML document represents a physical book that was
> digitized, and the structure of each document is essentially a header with
> metadata and a body with the OCR text of the book. The OCR text is split
> into pages, where one <page> element contains all the words from one
> corresponding printed page from the physical book. Obviously the number of
> words in each <page> varies widely based on the physical dimensions of the
> book and the typeface.
>
>
>
> So far, I have loaded 12,331 documents, containing a total of 2,196,771
> pages. The total size of those XML documents on disk is 4.7GB. But that is
> only a fraction of the total number of documents I want to load into BaseX.
> The total number is more like 160,000 documents. Assuming that the
> documents I’ve loaded so far are a representative sample, and I believe
> that’s true, then the total size of the XML documents on disk, prior to
> loading them into BaseX, would be about 4.7GB * 13 = 61.1GB.
>
>
>
> Normally the OCR text, once loaded, almost never changes. But the metadata
> fields do change as corrections are made. Also we add more XML documents
> routinely as we digitize more books over time. Therefore updates and
> additions are commonplace, such that keeping indexes up to date is
> important, to allow full-text searches to stay performant. I’m wondering if
> there are techniques for optimizing such quantities of text.
>
>
>
> Thanks,
>
> Greg
>
>
>
> *From: *Christian Grün <christian.gr...@gmail.com>
> *Date: *Thursday, March 14, 2024 at 8:48 AM
> *To: *Murray, Gregory <gregory.mur...@ptsem.edu>
> *Cc: *basex-talk@mailman.uni-konstanz.de <
> basex-talk@mailman.uni-konstanz.de>
> *Subject: *Re: [basex-talk] Out of Main Memory
>
> Hi Greg,
>
>
>
> A quick reply: If only parts of your documents are relevant for full-text
> queries, you can restrict the selection with the FTINDEX option (see [1]
> for more information).
>
>
>
> How large is the total size of your input documents?
>
>
>
> Best,
>
> Christian
>
>
>
> [1] https://docs.basex.org/wiki/Indexes#Selective_Indexing
>
>
>
>
>
>
>
> On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.mur...@ptsem.edu>
> wrote:
>
> Hello,
>
>
>
> I’m working with a database that has a full-text index. I have found that
> if I iteratively add XML documents, then optimize, add more documents,
> optimize again, and so on, eventually the “optimize” command will fail with
> “Out of Main Memory.” I edited the basex startup script to change the
> memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory,
> but of course the OS uses up some of it. I have found that if I exit
> memory-hungry programs (web browser, Oxygen), start basex, and then run the
> “optimize” command, I still get “Out of Main Memory.” I’m wondering if
> there are any known workarounds or strategies for this situation. If I
> understand the documentation about indexes correctly, index data is
> periodically written to disk during optimization. Does this mean that
> running optimize again will pick up where the previous attempt left off,
> such that running optimize repeatedly will eventually succeed?
>
>
>
> Thanks,
>
> Greg
>
>
>
>
>
> Gregory Murray
>
> Director of Digital Initiatives
>
> Wright Library
>
> Princeton Theological Seminary
>
>
>
>
>
>

Reply via email to