Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

Christian Grün Fri, 04 Oct 2019 16:20:21 -0700

About option 1: How much memory have you been able to assign to the Java VM?






first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. 2019,
01:11:

> I had another look at the script I wrote and realized that it's not
> working as it's supposed to.
> Apparently the order of operations should be this:
> - turn on all the types of indexes required
> - create the db
> - the parser settings and the filter settings
> - add all the files to the db
> - run "OPTIMIZE"
>
> If I'm not doing them in this order (specifically with "OPTIMIZE" at the
> end) the resulting db lacks all indexes.
>
>
>
> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
> randomcod...@gmail.com> wrote:
>
>> Hi Christian,
>>
>> About option 4:
>> I agree with the options you laid out. I am currently diving deeper into
>> option 4 in the list you wrote.
>> Regarding the partitioning strategy, I agree. I did manage however to
>> partition the files to be imported, into separate sets, with a constraint
>> on max partition size (on disk) and max partition file count (the number of
>> XML documents in each partition).
>> The tool called fpart [5] made this possible (I can imagine more
>> sophisticated bin-packing methods, involving pre-computed node count
>> values, and other variables, can be achieved via glpk [6] but that might be
>> too much work).
>> So, currently I am experimenting with a max partition size of 2.4GB and a
>> max file count of 85k files, and fpart seems to have split the file list
>> into 11 partitions of 33k files each and the size of a partition being ~
>> 2.4GB.
>> So, I wrote a script for this, it's called sharded-import.sh and attached
>> here. I'm also noticing that the /dba/ BaseX web interface is not blocked
>> anymore if I run this script, as opposed to running the previous import
>> where I run
>>   CREATE DB db_name /directory/
>> which allows me to see the progress or allows me to run queries before
>> the big import finishes.
>> Maybe the downside is that it's more verbose, and prints out a ton of
>> lines like
>>   > ADD /share/Public/archive/tech-sites/
>> linuxquestions.org/threads/viewtopic_9_356613.html
>>   Resource(s) added in 47.76 ms.
>> along the way, and maybe that's slower than before.
>>
>> About option 1:
>> Re: increase memory, I am running these experiments on a low-memory, old,
>> network-attached storage, model QNAP TS-451+ [7] [8], which I had to take
>> apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and
>> I can't seem to find around the house any additional memory sticks to take
>> it up to 8GB (which is also the maximum memory it supports). And if I want
>> to find like 2 x 4GB sticks of RAM, the frequency of the memory has to
>> match what it supports, I'm having trouble finding the exact one, Corsair
>> says it has memory sticks that would work, but I'd have to wait weeks for
>> them to ship to Bucharest which is where I live.
>> It seems like buying an Intel NUC that goes up to 64GB of memory would be
>> a bit too expensive at $1639 [9] but .. people on reddit [10] were
>> discussing some years back about this supermicro server [11] which is only
>> $668 and would allow to add up to 64GB of memory.
>> Basically I would buy something cheap that I can jampack with a lot of
>> RAM, but a hands-off approach would be best here, so if it comes
>> pre-equipped with all the memory and everything, would be nice (would spare
>> the trouble of having to buy the memory separate, making sure it matches
>> the motherboard specs etc).
>>
>> About option 2:
>> In fact, that's a great idea. But it would require me to write something
>> that would figure out the XPath patterns where the actual content sits. I
>> actually wanted to look for some algorithm that's designed to do that, and
>> try to implement it, but I had no time.
>> It would either have to detect the repetitive bloated nodes, and build
>> XPaths for the rest of the nodes, where the actual content sits. I think
>> this would be equivalent to computing the "web template" of a website,
>> given all its pages.
>> It would definitely decrease the size of the content that would have to
>> be indexed.
>> By the way, here I'm writing about a more general procedure, because it's
>> not just this dataset that I want to import.. I want to import heavy, large
>> amounts of data :)
>>
>> These are my thoughts for now
>>
>> [5] https://github.com/martymac/fpart
>> [6] https://www.gnu.org/software/glpk/
>> [7] https://www.amazon.com/dp/B015VNLGF8
>> [8] https://www.qnap.com/en/product/ts-451+
>> [9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>> [10]
>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
>> [11]
>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>>
>>
>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <christian.gr...@gmail.com>
>> wrote:
>>
>>> Exactly, it seems to be the final MERGE step during index creation
>>> that blows up your system. If you are restricted to the 2 GB of
>>> main-memory, this is what you could try next:
>>>
>>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s
>>> the largest value that you can assign on your system?
>>>
>>> 2. If you will query only specific values of your data sets, you can
>>> restrict your indexes to specific elements or attributes; this will
>>> reduce memory consumption (see [1] for details). If you observe that
>>> no indexes will be utilized in your queries anyway, you can simply
>>> disable the text and attribute indexes, and memory usage will shrink
>>> even more.
>>>
>>> 3. Create your database on a more powerful system [2] and move it to
>>> your target machine (makes only sense if there’s no need for further
>>> updates).
>>>
>>> 4. Distribute your data across multiple databases. In some way, this
>>> is comparable to sharding; it cannot be automated, though, as the
>>> partitioning strategy depends on the characteristics of your XML input
>>> data (some people have huge standalone documents, others have millions
>>> of small documents, …).
>>>
>>> [1] http://docs.basex.org/wiki/Indexes
>>> [2] A single CREATE call may be sufficient: CREATE DB database
>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz
>>>
>>>
>>>
>>>
>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name
>>> <randomcod...@gmail.com> wrote:
>>> >
>>> > I tried again, using SPLITSIZE = 12 in the .basex config file
>>> > The batch(console) script I used is attached mass-import.xq
>>> > This time I didn't do the optimize or index creation post-import, but
>>> instead, I did it as part of the import similar to what
>>> > is described in [4].
>>> > This time I got a different error, that is,
>>> "org.basex.core.BaseXException: Out of Main Memory."
>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any
>>> difference here?
>>> >
>>> > Thanks
>>> >
>>> > [4] http://docs.basex.org/wiki/Indexes#Performance
>>> >
>>> >
>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name <
>>> randomcod...@gmail.com> wrote:
>>> >>
>>> >> Hey Christian,
>>> >>
>>> >> Thank you for your answer :)
>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the
>>> same OOM behavior. It looks like the memory consumption is moderate until
>>> when it reaches about 30GB (the size of the db before optimize) and
>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with
>>> SPLITSIZE = 1000 and will report back if I get OOM again.
>>> >> Regarding what you said, it might be that the merge step is where the
>>> OOM occurs (I wonder if there's any way to control how much memory is being
>>> used inside the merge step).
>>> >>
>>> >> To quote the statistics page from the wiki:
>>> >>     Databases in BaseX are light-weight. If a database limit is
>>> reached, you can distribute your documents across multiple database
>>> instances and access all of them with a single XQuery expression.
>>> >> This to me sounds like sharding. I would probably be able to split
>>> the documents into chunks and upload them under a db with the same prefix,
>>> but varying suffix.. seems a lot like shards. By doing this
>>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe
>>> native mechanisms of avoiding OOM, I would try them.
>>> >>
>>> >> Best regards,
>>> >> Stefan
>>> >>
>>> >>
>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <
>>> christian.gr...@gmail.com> wrote:
>>> >>>
>>> >>> Hi first name,
>>> >>>
>>> >>> If you optimize your database, the indexes will be rebuilt. In this
>>> >>> step, the builder tries to guess how much free memory is still
>>> >>> available. If memory is exhausted, parts of the index will be split
>>> >>> (i. e., partially written to disk) and merged in a final step.
>>> >>> However, you can circumvent the heuristics by manually assigning a
>>> >>> static split value; see [1] for more information. If you use the DBA,
>>> >>> you’ll need to assign this value to your .basex or the web.xml file
>>> >>> [2]. In order to find the best value for your setup, it may be easier
>>> >>> to play around with the BaseX GUI.
>>> >>>
>>> >>> As you have already seen in our statistics, an XML document has
>>> >>> various properties that may represent a limit for a single database.
>>> >>> Accordingly, these properties make it difficult to decide for the
>>> >>> system when the memory will be exhausted during an import or index
>>> >>> rebuild.
>>> >>>
>>> >>> In general, you’ll get best performance (and your memory consumption
>>> >>> will be lower) if you create your database and specify the data to be
>>> >>> imported in a single run. This is currently not possible via the DBA;
>>> >>> use the GUI (Create Database) or console mode (CREATE DB command)
>>> >>> instead.
>>> >>>
>>> >>> Hope this helps,
>>> >>> Christian
>>> >>>
>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>>> >>> [2] http://docs.basex.org/wiki/Configuration
>>> >>>
>>> >>>
>>> >>>
>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>> >>> <randomcod...@gmail.com> wrote:
>>> >>> >
>>> >>> > Hi,
>>> >>> >
>>> >>> > Let's say there's a 30GB dataset [3] containing most threads/posts
>>> from [1].
>>> >>> > After importing all of it, when I try to run /dba/db-optimize/ on
>>> it (which must have some corresponding command) I get the OOM error in the
>>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory
>>> (the machine I'm running this on doesn't have a lot of memory).
>>> >>> > I was looking at [2] for some estimates of peak memory usage for
>>> this "db-optimize" operation, but couldn't find any.
>>> >>> > Actually it would be nice to know peak memory usage because.. of
>>> course, for any database (including BaseX) a common operation is to do
>>> server sizing, to know what kind of server would be needed.
>>> >>> > In this case, it seems like 2GB memory is enough to import 340k
>>> documents, weighing in at 30GB total, but it's not enough to run
>>> "dba-optimize".
>>> >>> > Is there any info about peak memory usage on [2] ? And are there
>>> guidelines for large-scale collection imports like I'm trying to do?
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Stefan
>>> >>> >
>>> >>> > [1] https://www.linuxquestions.org/
>>> >>> > [2] http://docs.basex.org/wiki/Statistics
>>> >>> > [3]
>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>>>
>>

Re: [basex-talk] basex OOM on 30GB database upon running /dba/db-optimize/

Reply via email to