About option 1: How much memory have you been able to assign to the Java VM?
first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt. 2019, 01:11: > I had another look at the script I wrote and realized that it's not > working as it's supposed to. > Apparently the order of operations should be this: > - turn on all the types of indexes required > - create the db > - the parser settings and the filter settings > - add all the files to the db > - run "OPTIMIZE" > > If I'm not doing them in this order (specifically with "OPTIMIZE" at the > end) the resulting db lacks all indexes. > > > > On Fri, Oct 4, 2019 at 11:32 PM first name last name < > randomcod...@gmail.com> wrote: > >> Hi Christian, >> >> About option 4: >> I agree with the options you laid out. I am currently diving deeper into >> option 4 in the list you wrote. >> Regarding the partitioning strategy, I agree. I did manage however to >> partition the files to be imported, into separate sets, with a constraint >> on max partition size (on disk) and max partition file count (the number of >> XML documents in each partition). >> The tool called fpart [5] made this possible (I can imagine more >> sophisticated bin-packing methods, involving pre-computed node count >> values, and other variables, can be achieved via glpk [6] but that might be >> too much work). >> So, currently I am experimenting with a max partition size of 2.4GB and a >> max file count of 85k files, and fpart seems to have split the file list >> into 11 partitions of 33k files each and the size of a partition being ~ >> 2.4GB. >> So, I wrote a script for this, it's called sharded-import.sh and attached >> here. I'm also noticing that the /dba/ BaseX web interface is not blocked >> anymore if I run this script, as opposed to running the previous import >> where I run >> CREATE DB db_name /directory/ >> which allows me to see the progress or allows me to run queries before >> the big import finishes. >> Maybe the downside is that it's more verbose, and prints out a ton of >> lines like >> > ADD /share/Public/archive/tech-sites/ >> linuxquestions.org/threads/viewtopic_9_356613.html >> Resource(s) added in 47.76 ms. >> along the way, and maybe that's slower than before. >> >> About option 1: >> Re: increase memory, I am running these experiments on a low-memory, old, >> network-attached storage, model QNAP TS-451+ [7] [8], which I had to take >> apart with a screwdriver to add 2GB of RAM (now it has 4GB of memory), and >> I can't seem to find around the house any additional memory sticks to take >> it up to 8GB (which is also the maximum memory it supports). And if I want >> to find like 2 x 4GB sticks of RAM, the frequency of the memory has to >> match what it supports, I'm having trouble finding the exact one, Corsair >> says it has memory sticks that would work, but I'd have to wait weeks for >> them to ship to Bucharest which is where I live. >> It seems like buying an Intel NUC that goes up to 64GB of memory would be >> a bit too expensive at $1639 [9] but .. people on reddit [10] were >> discussing some years back about this supermicro server [11] which is only >> $668 and would allow to add up to 64GB of memory. >> Basically I would buy something cheap that I can jampack with a lot of >> RAM, but a hands-off approach would be best here, so if it comes >> pre-equipped with all the memory and everything, would be nice (would spare >> the trouble of having to buy the memory separate, making sure it matches >> the motherboard specs etc). >> >> About option 2: >> In fact, that's a great idea. But it would require me to write something >> that would figure out the XPath patterns where the actual content sits. I >> actually wanted to look for some algorithm that's designed to do that, and >> try to implement it, but I had no time. >> It would either have to detect the repetitive bloated nodes, and build >> XPaths for the rest of the nodes, where the actual content sits. I think >> this would be equivalent to computing the "web template" of a website, >> given all its pages. >> It would definitely decrease the size of the content that would have to >> be indexed. >> By the way, here I'm writing about a more general procedure, because it's >> not just this dataset that I want to import.. I want to import heavy, large >> amounts of data :) >> >> These are my thoughts for now >> >> [5] https://github.com/martymac/fpart >> [6] https://www.gnu.org/software/glpk/ >> [7] https://www.amazon.com/dp/B015VNLGF8 >> [8] https://www.qnap.com/en/product/ts-451+ >> [9] https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/ >> [10] >> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/ >> [11] >> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E >> >> >> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <christian.gr...@gmail.com> >> wrote: >> >>> Exactly, it seems to be the final MERGE step during index creation >>> that blows up your system. If you are restricted to the 2 GB of >>> main-memory, this is what you could try next: >>> >>> 1. Did you already try to tweak the JVM memory limit via -Xmx? What’s >>> the largest value that you can assign on your system? >>> >>> 2. If you will query only specific values of your data sets, you can >>> restrict your indexes to specific elements or attributes; this will >>> reduce memory consumption (see [1] for details). If you observe that >>> no indexes will be utilized in your queries anyway, you can simply >>> disable the text and attribute indexes, and memory usage will shrink >>> even more. >>> >>> 3. Create your database on a more powerful system [2] and move it to >>> your target machine (makes only sense if there’s no need for further >>> updates). >>> >>> 4. Distribute your data across multiple databases. In some way, this >>> is comparable to sharding; it cannot be automated, though, as the >>> partitioning strategy depends on the characteristics of your XML input >>> data (some people have huge standalone documents, others have millions >>> of small documents, …). >>> >>> [1] http://docs.basex.org/wiki/Indexes >>> [2] A single CREATE call may be sufficient: CREATE DB database >>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz >>> >>> >>> >>> >>> On Thu, Oct 3, 2019 at 8:53 AM first name last name >>> <randomcod...@gmail.com> wrote: >>> > >>> > I tried again, using SPLITSIZE = 12 in the .basex config file >>> > The batch(console) script I used is attached mass-import.xq >>> > This time I didn't do the optimize or index creation post-import, but >>> instead, I did it as part of the import similar to what >>> > is described in [4]. >>> > This time I got a different error, that is, >>> "org.basex.core.BaseXException: Out of Main Memory." >>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make any >>> difference here? >>> > >>> > Thanks >>> > >>> > [4] http://docs.basex.org/wiki/Indexes#Performance >>> > >>> > >>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name < >>> randomcod...@gmail.com> wrote: >>> >> >>> >> Hey Christian, >>> >> >>> >> Thank you for your answer :) >>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen the >>> same OOM behavior. It looks like the memory consumption is moderate until >>> when it reaches about 30GB (the size of the db before optimize) and >>> >> then memory consumption spikes, and OOM occurs. Now I'm trying with >>> SPLITSIZE = 1000 and will report back if I get OOM again. >>> >> Regarding what you said, it might be that the merge step is where the >>> OOM occurs (I wonder if there's any way to control how much memory is being >>> used inside the merge step). >>> >> >>> >> To quote the statistics page from the wiki: >>> >> Databases in BaseX are light-weight. If a database limit is >>> reached, you can distribute your documents across multiple database >>> instances and access all of them with a single XQuery expression. >>> >> This to me sounds like sharding. I would probably be able to split >>> the documents into chunks and upload them under a db with the same prefix, >>> but varying suffix.. seems a lot like shards. By doing this >>> >> I think I can avoid OOM, but if BaseX provides other, better, maybe >>> native mechanisms of avoiding OOM, I would try them. >>> >> >>> >> Best regards, >>> >> Stefan >>> >> >>> >> >>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün < >>> christian.gr...@gmail.com> wrote: >>> >>> >>> >>> Hi first name, >>> >>> >>> >>> If you optimize your database, the indexes will be rebuilt. In this >>> >>> step, the builder tries to guess how much free memory is still >>> >>> available. If memory is exhausted, parts of the index will be split >>> >>> (i. e., partially written to disk) and merged in a final step. >>> >>> However, you can circumvent the heuristics by manually assigning a >>> >>> static split value; see [1] for more information. If you use the DBA, >>> >>> you’ll need to assign this value to your .basex or the web.xml file >>> >>> [2]. In order to find the best value for your setup, it may be easier >>> >>> to play around with the BaseX GUI. >>> >>> >>> >>> As you have already seen in our statistics, an XML document has >>> >>> various properties that may represent a limit for a single database. >>> >>> Accordingly, these properties make it difficult to decide for the >>> >>> system when the memory will be exhausted during an import or index >>> >>> rebuild. >>> >>> >>> >>> In general, you’ll get best performance (and your memory consumption >>> >>> will be lower) if you create your database and specify the data to be >>> >>> imported in a single run. This is currently not possible via the DBA; >>> >>> use the GUI (Create Database) or console mode (CREATE DB command) >>> >>> instead. >>> >>> >>> >>> Hope this helps, >>> >>> Christian >>> >>> >>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE >>> >>> [2] http://docs.basex.org/wiki/Configuration >>> >>> >>> >>> >>> >>> >>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name >>> >>> <randomcod...@gmail.com> wrote: >>> >>> > >>> >>> > Hi, >>> >>> > >>> >>> > Let's say there's a 30GB dataset [3] containing most threads/posts >>> from [1]. >>> >>> > After importing all of it, when I try to run /dba/db-optimize/ on >>> it (which must have some corresponding command) I get the OOM error in the >>> stacktrace attached. I am using -Xmx2g so BaseX is limited to 2GB of memory >>> (the machine I'm running this on doesn't have a lot of memory). >>> >>> > I was looking at [2] for some estimates of peak memory usage for >>> this "db-optimize" operation, but couldn't find any. >>> >>> > Actually it would be nice to know peak memory usage because.. of >>> course, for any database (including BaseX) a common operation is to do >>> server sizing, to know what kind of server would be needed. >>> >>> > In this case, it seems like 2GB memory is enough to import 340k >>> documents, weighing in at 30GB total, but it's not enough to run >>> "dba-optimize". >>> >>> > Is there any info about peak memory usage on [2] ? And are there >>> guidelines for large-scale collection imports like I'm trying to do? >>> >>> > >>> >>> > Thanks, >>> >>> > Stefan >>> >>> > >>> >>> > [1] https://www.linuxquestions.org/ >>> >>> > [2] http://docs.basex.org/wiki/Statistics >>> >>> > [3] >>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2 >>> >>