Hi, Greg, Assuming you have multiple cores available, you can also execute a search in parallel using the (BaseX-specific) xquery:fork-join function[1]. That’s what I usually do when searching across databases.
All best, Tim [1] https://docs.basex.org/wiki/XQuery_Module#xquery:fork-join -- Tim A. Thompson (he, him) Librarian for Applied Metadata Research Yale University Library www.linkedin.com/in/timathompson<http://www.linkedin.com/in/timathompson> From: BaseX-Talk <basex-talk-boun...@mailman.uni-konstanz.de> on behalf of Murray, Gregory <gregory.mur...@ptsem.edu> Date: Friday, March 15, 2024 at 12:12 PM To: Christian Grün <christian.gr...@gmail.com> Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Out of Main Memory Thanks, Christian. Distributing documents across many databases sounds fine, as long as XPath expressions and full-text searching remain reasonably efficient. In the documentation, the example of addressing multiple databases uses a loop: for $i in 1 to 100 return db:get('books' || $i)//book/title Is that the preferred technique? Also, is it possible to perform searches in the same manner without interfering with relevance scores? Thanks, Greg From: Christian Grün <christian.gr...@gmail.com> Date: Friday, March 15, 2024 at 11:51 AM To: Murray, Gregory <gregory.mur...@ptsem.edu> Cc: basex-talk@mailman.uni-konstanz.de <basex-talk@mailman.uni-konstanz.de> Subject: Re: [basex-talk] Out of Main Memory Hi Greg, I would have guessed that 12 GB is enough for 4.7 GB; but it sometimes depends on the input. If you like, you can share a single typical document with us, and we can have a look at it. 61 GB will be too large for a complete full-text index, though. However, it’s always possible to distribute documets across multiple databases and access them with a single query [1]. The full-text index is not incremental (in opposition to the other index structures), which means it must be re-created it after updates. However, it’s possible to re-index updated database instances and query fully indexed databases at the same time. Hope this helps, Christian [1] https://docs.basex.org/wiki/Databases On Thu, Mar 14, 2024 at 10:58 PM Murray, Gregory <gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote: Thanks, Christian. I don’t think selective indexing is applicable in my use case, because I need to perform full-text searches on the entirety of each document. Each XML document represents a physical book that was digitized, and the structure of each document is essentially a header with metadata and a body with the OCR text of the book. The OCR text is split into pages, where one <page> element contains all the words from one corresponding printed page from the physical book. Obviously the number of words in each <page> varies widely based on the physical dimensions of the book and the typeface. So far, I have loaded 12,331 documents, containing a total of 2,196,771 pages. The total size of those XML documents on disk is 4.7GB. But that is only a fraction of the total number of documents I want to load into BaseX. The total number is more like 160,000 documents. Assuming that the documents I’ve loaded so far are a representative sample, and I believe that’s true, then the total size of the XML documents on disk, prior to loading them into BaseX, would be about 4.7GB * 13 = 61.1GB. Normally the OCR text, once loaded, almost never changes. But the metadata fields do change as corrections are made. Also we add more XML documents routinely as we digitize more books over time. Therefore updates and additions are commonplace, such that keeping indexes up to date is important, to allow full-text searches to stay performant. I’m wondering if there are techniques for optimizing such quantities of text. Thanks, Greg From: Christian Grün <christian.gr...@gmail.com<mailto:christian.gr...@gmail.com>> Date: Thursday, March 14, 2024 at 8:48 AM To: Murray, Gregory <gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> Cc: basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de> <basex-talk@mailman.uni-konstanz.de<mailto:basex-talk@mailman.uni-konstanz.de>> Subject: Re: [basex-talk] Out of Main Memory Hi Greg, A quick reply: If only parts of your documents are relevant for full-text queries, you can restrict the selection with the FTINDEX option (see [1] for more information). How large is the total size of your input documents? Best, Christian [1] https://docs.basex.org/wiki/Indexes#Selective_Indexing On Tue, Mar 12, 2024 at 8:34 PM Murray, Gregory <gregory.mur...@ptsem.edu<mailto:gregory.mur...@ptsem.edu>> wrote: Hello, I’m working with a database that has a full-text index. I have found that if I iteratively add XML documents, then optimize, add more documents, optimize again, and so on, eventually the “optimize” command will fail with “Out of Main Memory.” I edited the basex startup script to change the memory allocation from -Xmx2g to -Xmx12g. My computer has 16 GB of memory, but of course the OS uses up some of it. I have found that if I exit memory-hungry programs (web browser, Oxygen), start basex, and then run the “optimize” command, I still get “Out of Main Memory.” I’m wondering if there are any known workarounds or strategies for this situation. If I understand the documentation about indexes correctly, index data is periodically written to disk during optimization. Does this mean that running optimize again will pick up where the previous attempt left off, such that running optimize repeatedly will eventually succeed? Thanks, Greg Gregory Murray Director of Digital Initiatives Wright Library Princeton Theological Seminary