Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

Christian Grün Sat, 05 Oct 2019 13:09:14 -0700

The stack Trace indicates that you enabled the fulltext index as well. For
this index, you definitely need more memory than available on your system.


So I assume you didn't encounter trouble with the default index structures?




first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. 2019,
20:52:

> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
> allocate on that system for BaseX) and I got OOM.
>
> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <christian.gr...@gmail.com>
> wrote:
>
>> About option 1: How much memory have you been able to assign to the Java
>> VM?
>>
>>
>>
>>
>>
>> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt.
>> 2019, 01:11:
>>
>>> I had another look at the script I wrote and realized that it's not
>>> working as it's supposed to.
>>> Apparently the order of operations should be this:
>>> - turn on all the types of indexes required
>>> - create the db
>>> - the parser settings and the filter settings
>>> - add all the files to the db
>>> - run "OPTIMIZE"
>>>
>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
>>> end) the resulting db lacks all indexes.
>>>
>>>
>>>
>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>>> randomcod...@gmail.com> wrote:
>>>
>>>> Hi Christian,
>>>>
>>>> About option 4:
>>>> I agree with the options you laid out. I am currently diving deeper
>>>> into option 4 in the list you wrote.
>>>> Regarding the partitioning strategy, I agree. I did manage however to
>>>> partition the files to be imported, into separate sets, with a constraint
>>>> on max partition size (on disk) and max partition file count (the number of
>>>> XML documents in each partition).
>>>> The tool called fpart [5] made this possible (I can imagine more
>>>> sophisticated bin-packing methods, involving pre-computed node count
>>>> values, and other variables, can be achieved via glpk [6] but that might be
>>>> too much work).
>>>> So, currently I am experimenting with a max partition size of 2.4GB and
>>>> a max file count of 85k files, and fpart seems to have split the file list
>>>> into 11 partitions of 33k files each and the size of a partition being ~
>>>> 2.4GB.
>>>> So, I wrote a script for this, it's called sharded-import.sh and
>>>> attached here. I'm also noticing that the /dba/ BaseX web interface is not
>>>> blocked anymore if I run this script, as opposed to running the previous
>>>> import where I run
>>>>   CREATE DB db_name /directory/
>>>> which allows me to see the progress or allows me to run queries before
>>>> the big import finishes.
>>>> Maybe the downside is that it's more verbose, and prints out a ton of
>>>> lines like
>>>>   > ADD /share/Public/archive/tech-sites/
>>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>>   Resource(s) added in 47.76 ms.
>>>> along the way, and maybe that's slower than before.
>>>>
>>>> About option 1:
>>>> Re: increase memory, I am running these experiments on a low-memory,
>>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had to
>>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory),
>>>> and I can't seem to find around the house any additional memory sticks to
>>>> take it up to 8GB (which is also the maximum memory it supports). And if I
>>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>>>> match what it supports, I'm having trouble finding the exact one, Corsair
>>>> says it has memory sticks that would work, but I'd have to wait weeks for
>>>> them to ship to Bucharest which is where I live.
>>>> It seems like buying an Intel NUC that goes up to 64GB of memory would
>>>> be a bit too expensive at $1639 [9] but .. people on reddit [10] were
>>>> discussing some years back about this supermicro server [11] which is only
>>>> $668 and would allow to add up to 64GB of memory.
>>>> Basically I would buy something cheap that I can jampack with a lot of
>>>> RAM, but a hands-off approach would be best here, so if it comes
>>>> pre-equipped with all the memory and everything, would be nice (would spare
>>>> the trouble of having to buy the memory separate, making sure it matches
>>>> the motherboard specs etc).
>>>>
>>>> About option 2:
>>>> In fact, that's a great idea. But it would require me to write
>>>> something that would figure out the XPath patterns where the actual content
>>>> sits. I actually wanted to look for some algorithm that's designed to do
>>>> that, and try to implement it, but I had no time.
>>>> It would either have to detect the repetitive bloated nodes, and build
>>>> XPaths for the rest of the nodes, where the actual content sits. I think
>>>> this would be equivalent to computing the "web template" of a website,
>>>> given all its pages.
>>>> It would definitely decrease the size of the content that would have to
>>>> be indexed.
>>>> By the way, here I'm writing about a more general procedure, because
>>>> it's not just this dataset that I want to import.. I want to import heavy,
>>>> large amounts of data :)
>>>>
>>>> These are my thoughts for now
>>>>
>>>> [5] https://github.com/martymac/fpart
>>>> [6] https://www.gnu.org/software/glpk/
>>>> [7] https://www.amazon.com/dp/B015VNLGF8
>>>> [8] https://www.qnap.com/en/product/ts-451+
>>>> [9]
>>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>>>> [10]
>>>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
>>>> [11]
>>>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>>>>
>>>>
>>>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <
>>>> christian.gr...@gmail.com> wrote:
>>>>
>>>>> Exactly, it seems to be the final MERGE step during index creation
>>>>> that blows up your system. If you are restricted to the 2 GB of
>>>>> main-memory, this is what you could try next:
>>>>>
>>>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
>>>>> the largest value that you can assign on your system?
>>>>>
>>>>> 2. If you will query only specific values of your data sets, you can
>>>>> restrict your indexes to specific elements or attributes; this will
>>>>> reduce memory consumption (see [1] for details). If you observe that
>>>>> no indexes will be utilized in your queries anyway, you can simply
>>>>> disable the text and attribute indexes, and memory usage will shrink
>>>>> even more.
>>>>>
>>>>> 3. Create your database on a more powerful system [2] and move it to
>>>>> your target machine (makes only sense if there’s no need for further
>>>>> updates).
>>>>>
>>>>> 4. Distribute your data across multiple databases. In some way, this
>>>>> is comparable to sharding; it cannot be automated, though, as the
>>>>> partitioning strategy depends on the characteristics of your XML input
>>>>> data (some people have huge standalone documents, others have millions
>>>>> of small documents, …).
>>>>>
>>>>> [1] http://docs.basex.org/wiki/Indexes
>>>>> [2] A single CREATE call may be sufficient: CREATE DB database
>>>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name
>>>>> <randomcod...@gmail.com> wrote:
>>>>> >
>>>>> > I tried again, using SPLITSIZE = 12 in the .basex config file
>>>>> > The batch(console) script I used is attached mass-import.xq
>>>>> > This time I didn't do the optimize or index creation post-import,
>>>>> but instead, I did it as part of the import similar to what
>>>>> > is described in [4].
>>>>> > This time I got a different error, that is,
>>>>> "org.basex.core.BaseXException: Out of Main Memory."
>>>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
>>>>> difference here?
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> > [4] http://docs.basex.org/wiki/Indexes#Performance
>>>>> >
>>>>> >
>>>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name <
>>>>> randomcod...@gmail.com> wrote:
>>>>> >>
>>>>> >> Hey Christian,
>>>>> >>
>>>>> >> Thank you for your answer :)
>>>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the
>>>>> same OOM behavior. It looks like the memory consumption is moderate until
>>>>> when it reaches about 30GB (the size of the db before optimize) and
>>>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with
>>>>> SPLITSIZE = 1000 and will report back if I get OOM again.
>>>>> >> Regarding what you said, it might be that the merge step is where
>>>>> the OOM occurs (I wonder if there's any way to control how much memory is
>>>>> being used inside the merge step).
>>>>> >>
>>>>> >> To quote the statistics page from the wiki:
>>>>> >>     Databases in BaseX are light-weight. If a database limit is
>>>>> reached, you can distribute your documents across multiple database
>>>>> instances and access all of them with a single XQuery expression.
>>>>> >> This to me sounds like sharding. I would probably be able to split
>>>>> the documents into chunks and upload them under a db with the same prefix,
>>>>> but varying suffix.. seems a lot like shards. By doing this
>>>>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe
>>>>> native mechanisms of avoiding OOM, I would try them.
>>>>> >>
>>>>> >> Best regards,
>>>>> >> Stefan
>>>>> >>
>>>>> >>
>>>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <
>>>>> christian.gr...@gmail.com> wrote:
>>>>> >>>
>>>>> >>> Hi first name,
>>>>> >>>
>>>>> >>> If you optimize your database, the indexes will be rebuilt. In this
>>>>> >>> step, the builder tries to guess how much free memory is still
>>>>> >>> available. If memory is exhausted, parts of the index will be split
>>>>> >>> (i. e., partially written to disk) and merged in a final step.
>>>>> >>> However, you can circumvent the heuristics by manually assigning a
>>>>> >>> static split value; see [1] for more information. If you use the
>>>>> DBA,
>>>>> >>> you’ll need to assign this value to your .basex or the web.xml file
>>>>> >>> [2]. In order to find the best value for your setup, it may be
>>>>> easier
>>>>> >>> to play around with the BaseX GUI.
>>>>> >>>
>>>>> >>> As you have already seen in our statistics, an XML document has
>>>>> >>> various properties that may represent a limit for a single
>>>>> database.
>>>>> >>> Accordingly, these properties make it difficult to decide for the
>>>>> >>> system when the memory will be exhausted during an import or index
>>>>> >>> rebuild.
>>>>> >>>
>>>>> >>> In general, you’ll get best performance (and your memory
>>>>> consumption
>>>>> >>> will be lower) if you create your database and specify the data to
>>>>> be
>>>>> >>> imported in a single run. This is currently not possible via the
>>>>> DBA;
>>>>> >>> use the GUI (Create Database) or console mode (CREATE DB command)
>>>>> >>> instead.
>>>>> >>>
>>>>> >>> Hope this helps,
>>>>> >>> Christian
>>>>> >>>
>>>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>>>>> >>> [2] http://docs.basex.org/wiki/Configuration
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>>>> >>> <randomcod...@gmail.com> wrote:
>>>>> >>> >
>>>>> >>> > Hi,
>>>>> >>> >
>>>>> >>> > Let's say there's a 30GB dataset [3] containing most
>>>>> threads/posts from [1].
>>>>> >>> > After importing all of it, when I try to run /dba/db-optimize/
>>>>> on it (which must have some corresponding command) I get the OOM error in
>>>>> the stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of
>>>>> memory (the machine I'm running this on doesn't have a lot of memory).
>>>>> >>> > I was looking at [2] for some estimates of peak memory usage for
>>>>> this "db-optimize" operation, but couldn't find any.
>>>>> >>> > Actually it would be nice to know peak memory usage because.. of
>>>>> course, for any database (including BaseX) a common operation is to do
>>>>> server sizing, to know what kind of server would be needed.
>>>>> >>> > In this case, it seems like 2GB memory is enough to import 340k
>>>>> documents, weighing in at 30GB total, but it's not enough to run
>>>>> "dba-optimize".
>>>>> >>> > Is there any info about peak memory usage on [2] ? And are there
>>>>> guidelines for large-scale collection imports like I'm trying to do?
>>>>> >>> >
>>>>> >>> > Thanks,
>>>>> >>> > Stefan
>>>>> >>> >
>>>>> >>> > [1] https://www.linuxquestions.org/
>>>>> >>> > [2] http://docs.basex.org/wiki/Statistics
>>>>> >>> > [3]
>>>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>>>>>
>>>>

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

Reply via email to