The stack Trace indicates that you enabled the fulltext index as well. For this index, you definitely need more memory than available on your system.
So I assume you didn't encounter trouble with the default index structures? first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. 2019, 20:52: > Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can > allocate on that system for BaseX) and I got OOM. > > On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <christian.gr...@gmail.com> > wrote: > >> About option 1: How much memory have you been able to assign to the Java >> VM? >> >> >> >> >> >> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. >> 2019, 01:11: >> >>> I had another look at the script I wrote and realized that it's not >>> working as it's supposed to. >>> Apparently the order of operations should be this: >>> - turn on all the types of indexes required >>> - create the db >>> - the parser settings and the filter settings >>> - add all the files to the db >>> - run "OPTIMIZE" >>> >>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the >>> end) the resulting db lacks all indexes. >>> >>> >>> >>> On Fri, Oct 4, 2019 at 11:32 PM first name last name < >>> randomcod...@gmail.com> wrote: >>> >>>> Hi Christian, >>>> >>>> About option 4: >>>> I agree with the options you laid out. I am currently diving deeper >>>> into option 4 in the list you wrote. >>>> Regarding the partitioning strategy, I agree. I did manage however to >>>> partition the files to be imported, into separate sets, with a constraint >>>> on max partition size (on disk) and max partition file count (the number of >>>> XML documents in each partition). >>>> The tool called fpart [5] made this possible (I can imagine more >>>> sophisticated bin-packing methods, involving pre-computed node count >>>> values, and other variables, can be achieved via glpk [6] but that might be >>>> too much work). >>>> So, currently I am experimenting with a max partition size of 2.4GB and >>>> a max file count of 85k files, and fpart seems to have split the file list >>>> into 11 partitions of 33k files each and the size of a partition being ~ >>>> 2.4GB. >>>> So, I wrote a script for this, it's called sharded-import.sh and >>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not >>>> blocked anymore if I run this script, as opposed to running the previous >>>> import where I run >>>> CREATE DB db_name /directory/ >>>> which allows me to see the progress or allows me to run queries before >>>> the big import finishes. >>>> Maybe the downside is that it's more verbose, and prints out a ton of >>>> lines like >>>> > ADD /share/Public/archive/tech-sites/ >>>> linuxquestions.org/threads/viewtopic_9_356613.html >>>> Resource(s) added in 47.76 ms. >>>> along the way, and maybe that's slower than before. >>>> >>>> About option 1: >>>> Re: increase memory, I am running these experiments on a low-memory, >>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to >>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), >>>> and I can't seem to find around the house any additional memory sticks to >>>> take it up to 8GB (which is also the maximum memory it supports). And if I >>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to >>>> match what it supports, I'm having trouble finding the exact one, Corsair >>>> says it has memory sticks that would work, but I'd have to wait weeks for >>>> them to ship to Bucharest which is where I live. >>>> It seems like buying an Intel NUC that goes up to 64GB of memory would >>>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were >>>> discussing some years back about this supermicro server [11] which is only >>>> $668 and would allow to add up to 64GB of memory. >>>> Basically I would buy something cheap that I can jampack with a lot of >>>> RAM, but a hands-off approach would be best here, so if it comes >>>> pre-equipped with all the memory and everything, would be nice (would spare >>>> the trouble of having to buy the memory separate, making sure it matches >>>> the motherboard specs etc). >>>> >>>> About option 2: >>>> In fact, that's a great idea. But it would require me to write >>>> something that would figure out the XPath patterns where the actual content >>>> sits. I actually wanted to look for some algorithm that's designed to do >>>> that, and try to implement it, but I had no time. >>>> It would either have to detect the repetitive bloated nodes, and build >>>> XPaths for the rest of the nodes, where the actual content sits. I think >>>> this would be equivalent to computing the "web template" of a website, >>>> given all its pages. >>>> It would definitely decrease the size of the content that would have to >>>> be indexed. >>>> By the way, here I'm writing about a more general procedure, because >>>> it's not just this dataset that I want to import.. I want to import heavy, >>>> large amounts of data :) >>>> >>>> These are my thoughts for now >>>> >>>> [5] https://github.com/martymac/fpart >>>> [6] https://www.gnu.org/software/glpk/ >>>> [7] https://www.amazon.com/dp/B015VNLGF8 >>>> [8] https://www.qnap.com/en/product/ts-451+ >>>> [9] >>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/ >>>> [10] >>>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/ >>>> [11] >>>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E >>>> >>>> >>>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün < >>>> christian.gr...@gmail.com> wrote: >>>> >>>>> Exactly, it seems to be the final MERGE step during index creation >>>>> that blows up your system. If you are restricted to the 2 GB of >>>>> main-memory, this is what you could try next: >>>>> >>>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s >>>>> the largest value that you can assign on your system? >>>>> >>>>> 2. If you will query only specific values of your data sets, you can >>>>> restrict your indexes to specific elements or attributes; this will >>>>> reduce memory consumption (see [1] for details). If you observe that >>>>> no indexes will be utilized in your queries anyway, you can simply >>>>> disable the text and attribute indexes, and memory usage will shrink >>>>> even more. >>>>> >>>>> 3. Create your database on a more powerful system [2] and move it to >>>>> your target machine (makes only sense if there’s no need for further >>>>> updates). >>>>> >>>>> 4. Distribute your data across multiple databases. In some way, this >>>>> is comparable to sharding; it cannot be automated, though, as the >>>>> partitioning strategy depends on the characteristics of your XML input >>>>> data (some people have huge standalone documents, others have millions >>>>> of small documents, …). >>>>> >>>>> [1] http://docs.basex.org/wiki/Indexes >>>>> [2] A single CREATE call may be sufficient: CREATE DB database >>>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz >>>>> >>>>> >>>>> >>>>> >>>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name >>>>> <randomcod...@gmail.com> wrote: >>>>> > >>>>> > I tried again, using SPLITSIZE = 12 in the .basex config file >>>>> > The batch(console) script I used is attached mass-import.xq >>>>> > This time I didn't do the optimize or index creation post-import, >>>>> but instead, I did it as part of the import similar to what >>>>> > is described in [4]. >>>>> > This time I got a different error, that is, >>>>> "org.basex.core.BaseXException: Out of Main Memory." >>>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any >>>>> difference here? >>>>> > >>>>> > Thanks >>>>> > >>>>> > [4] http://docs.basex.org/wiki/Indexes#Performance >>>>> > >>>>> > >>>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name < >>>>> randomcod...@gmail.com> wrote: >>>>> >> >>>>> >> Hey Christian, >>>>> >> >>>>> >> Thank you for your answer :) >>>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the >>>>> same OOM behavior. It looks like the memory consumption is moderate until >>>>> when it reaches about 30GB (the size of the db before optimize) and >>>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with >>>>> SPLITSIZE = 1000 and will report back if I get OOM again. >>>>> >> Regarding what you said, it might be that the merge step is where >>>>> the OOM occurs (I wonder if there's any way to control how much memory is >>>>> being used inside the merge step). >>>>> >> >>>>> >> To quote the statistics page from the wiki: >>>>> >> Databases in BaseX are light-weight. If a database limit is >>>>> reached, you can distribute your documents across multiple database >>>>> instances and access all of them with a single XQuery expression. >>>>> >> This to me sounds like sharding. I would probably be able to split >>>>> the documents into chunks and upload them under a db with the same prefix, >>>>> but varying suffix.. seems a lot like shards. By doing this >>>>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe >>>>> native mechanisms of avoiding OOM, I would try them. >>>>> >> >>>>> >> Best regards, >>>>> >> Stefan >>>>> >> >>>>> >> >>>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün < >>>>> christian.gr...@gmail.com> wrote: >>>>> >>> >>>>> >>> Hi first name, >>>>> >>> >>>>> >>> If you optimize your database, the indexes will be rebuilt. In this >>>>> >>> step, the builder tries to guess how much free memory is still >>>>> >>> available. If memory is exhausted, parts of the index will be split >>>>> >>> (i. e., partially written to disk) and merged in a final step. >>>>> >>> However, you can circumvent the heuristics by manually assigning a >>>>> >>> static split value; see [1] for more information. If you use the >>>>> DBA, >>>>> >>> you’ll need to assign this value to your .basex or the web.xml file >>>>> >>> [2]. In order to find the best value for your setup, it may be >>>>> easier >>>>> >>> to play around with the BaseX GUI. >>>>> >>> >>>>> >>> As you have already seen in our statistics, an XML document has >>>>> >>> various properties that may represent a limit for a single >>>>> database. >>>>> >>> Accordingly, these properties make it difficult to decide for the >>>>> >>> system when the memory will be exhausted during an import or index >>>>> >>> rebuild. >>>>> >>> >>>>> >>> In general, you’ll get best performance (and your memory >>>>> consumption >>>>> >>> will be lower) if you create your database and specify the data to >>>>> be >>>>> >>> imported in a single run. This is currently not possible via the >>>>> DBA; >>>>> >>> use the GUI (Create Database) or console mode (CREATE DB command) >>>>> >>> instead. >>>>> >>> >>>>> >>> Hope this helps, >>>>> >>> Christian >>>>> >>> >>>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE >>>>> >>> [2] http://docs.basex.org/wiki/Configuration >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name >>>>> >>> <randomcod...@gmail.com> wrote: >>>>> >>> > >>>>> >>> > Hi, >>>>> >>> > >>>>> >>> > Let's say there's a 30GB dataset [3] containing most >>>>> threads/posts from [1]. >>>>> >>> > After importing all of it, when I try to run /dba/db-optimize/ >>>>> on it (which must have some corresponding command) I get the OOM error in >>>>> the stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of >>>>> memory (the machine I'm running this on doesn't have a lot of memory). >>>>> >>> > I was looking at [2] for some estimates of peak memory usage for >>>>> this "db-optimize" operation, but couldn't find any. >>>>> >>> > Actually it would be nice to know peak memory usage because.. of >>>>> course, for any database (including BaseX) a common operation is to do >>>>> server sizing, to know what kind of server would be needed. >>>>> >>> > In this case, it seems like 2GB memory is enough to import 340k >>>>> documents, weighing in at 30GB total, but it's not enough to run >>>>> "dba-optimize". >>>>> >>> > Is there any info about peak memory usage on [2] ? And are there >>>>> guidelines for large-scale collection imports like I'm trying to do? >>>>> >>> > >>>>> >>> > Thanks, >>>>> >>> > Stefan >>>>> >>> > >>>>> >>> > [1] https://www.linuxquestions.org/ >>>>> >>> > [2] http://docs.basex.org/wiki/Statistics >>>>> >>> > [3] >>>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2 >>>>> >>>>