Regarding selective full-text indexing, I just tried
XQUERY db:optimize("linuxquestions.org-selective", true(), map { 'ftindex':
true(), 'ftinclude': 'div table td a' })
And I got OOM on that, the exact stacktrace attached in this message.

I will open a separate thread regarding migrating the data from BaseX
shards to PostgreSQL (for the purpose of full-text indexing).

On Sun, Oct 6, 2019 at 10:19 AM Christian Grün <christian.gr...@gmail.com>
wrote:

> The current full text index builder provides a similar outsourcing
> mechanism to that of the index builder for the default index structures;
> but the meta data structures are kept in main-memory, and they are more
> bulky. There are definitely ways to tackle this technically; it hasn't been
> of high priority so far, but this may change.
>
> Please note that you won't create an index over your whole data set in
> RDBMS. Instead, you'll usually create it for specific fields that you will
> query later on. It's a convenience feature in BaseX that you can build an
> index for all of your data. For large full-text corpora, however, it's
> recommendable in most cases to restrict indexing to the relevant XML
> elements.
>
>
>
>
> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt.
> 2019, 23:28:
>
>> Attached a more complete output of ./bin/basexhttp . Judging from this
>> output, it would seem that everything was ok, except for the full-text
>> index.
>> I now realize that I have another question about full-text indexes. It
>> seems like the full-text index here is dependent on the amount of memory
>> available (in other words, the more data to be indexed, the more RAM memory
>> required).
>>
>> I was using a certain popular RDBMS, for full-text indexing, and I never
>> bumped into problems like it running out of memory when building such
>> indexes.
>> I think their model uses a certain buffer in memory, and it keeps
>> multiple files on disk where it store data, and then it assembles together
>> the results in-memory
>> but always keeping the constraint of using only as much memory as was
>> declared to be allowed for it to use.
>> Perhaps the topic would be "external memory algorithms" or "full-text
>> search using secondary storage".
>> I'm not an expert in this field, but.. my question here would be if this
>> kind of thing is something that BaseX is looking to handle in the future?
>>
>> Thanks,
>> Stefan
>>
>>
>> On Sat, Oct 5, 2019 at 11:08 PM Christian Grün <christian.gr...@gmail.com>
>> wrote:
>>
>>> The stack Trace indicates that you enabled the fulltext index as well.
>>> For this index, you definitely need more memory than available on your
>>> system.
>>>
>>> So I assume you didn't encounter trouble with the default index
>>> structures?
>>>
>>>
>>>
>>>
>>> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt.
>>> 2019, 20:52:
>>>
>>>> Yes, I did, with -Xmx3100m (that's the maximum amount of memory I can
>>>> allocate on that system for BaseX) and I got OOM.
>>>>
>>>> On Sat, Oct 5, 2019 at 2:19 AM Christian Grün <
>>>> christian.gr...@gmail.com> wrote:
>>>>
>>>>> About option 1: How much memory have you been able to assign to the
>>>>> Java VM?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> first name last name <randomcod...@gmail.com> schrieb am Sa., 5. Okt.
>>>>> 2019, 01:11:
>>>>>
>>>>>> I had another look at the script I wrote and realized that it's not
>>>>>> working as it's supposed to.
>>>>>> Apparently the order of operations should be this:
>>>>>> - turn on all the types of indexes required
>>>>>> - create the db
>>>>>> - the parser settings and the filter settings
>>>>>> - add all the files to the db
>>>>>> - run "OPTIMIZE"
>>>>>>
>>>>>> If I'm not doing them in this order (specifically with "OPTIMIZE" at
>>>>>> the end) the resulting db lacks all indexes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 4, 2019 at 11:32 PM first name last name <
>>>>>> randomcod...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> About option 4:
>>>>>>> I agree with the options you laid out. I am currently diving deeper
>>>>>>> into option 4 in the list you wrote.
>>>>>>> Regarding the partitioning strategy, I agree. I did manage however
>>>>>>> to partition the files to be imported, into separate sets, with a
>>>>>>> constraint on max partition size (on disk) and max partition file count
>>>>>>> (the number of XML documents in each partition).
>>>>>>> The tool called fpart [5] made this possible (I can imagine more
>>>>>>> sophisticated bin-packing methods, involving pre-computed node count
>>>>>>> values, and other variables, can be achieved via glpk [6] but that 
>>>>>>> might be
>>>>>>> too much work).
>>>>>>> So, currently I am experimenting with a max partition size of 2.4GB
>>>>>>> and a max file count of 85k files, and fpart seems to have split the 
>>>>>>> file
>>>>>>> list into 11 partitions of 33k files each and the size of a partition 
>>>>>>> being
>>>>>>> ~ 2.4GB.
>>>>>>> So, I wrote a script for this, it's called sharded-import.sh and
>>>>>>> attached here. I'm also noticing that the /dba/ BaseX web interface is 
>>>>>>> not
>>>>>>> blocked anymore if I run this script, as opposed to running the previous
>>>>>>> import where I run
>>>>>>>   CREATE DB db_name /directory/
>>>>>>> which allows me to see the progress or allows me to run queries
>>>>>>> before the big import finishes.
>>>>>>> Maybe the downside is that it's more verbose, and prints out a ton
>>>>>>> of lines like
>>>>>>>   > ADD /share/Public/archive/tech-sites/
>>>>>>> linuxquestions.org/threads/viewtopic_9_356613.html
>>>>>>>   Resource(s) added in 47.76 ms.
>>>>>>> along the way, and maybe that's slower than before.
>>>>>>>
>>>>>>> About option 1:
>>>>>>> Re: increase memory, I am running these experiments on a low-memory,
>>>>>>> old, network-attached storage, model QNAP TS-451+ [7] [8], which I had 
>>>>>>> to
>>>>>>> take apart with a screwdriver to add 2GB of RAM (now it has 4GB of 
>>>>>>> memory),
>>>>>>> and I can't seem to find around the house any additional memory sticks 
>>>>>>> to
>>>>>>> take it up to 8GB (which is also the maximum memory it supports). And 
>>>>>>> if I
>>>>>>> want to find like 2 x 4GB sticks of RAM, the frequency of the memory 
>>>>>>> has to
>>>>>>> match what it supports, I'm having trouble finding the exact one, 
>>>>>>> Corsair
>>>>>>> says it has memory sticks that would work, but I'd have to wait weeks 
>>>>>>> for
>>>>>>> them to ship to Bucharest which is where I live.
>>>>>>> It seems like buying an Intel NUC that goes up to 64GB of memory
>>>>>>> would be a bit too expensive at $1639 [9] but .. people on reddit [10] 
>>>>>>> were
>>>>>>> discussing some years back about this supermicro server [11] which is 
>>>>>>> only
>>>>>>> $668 and would allow to add up to 64GB of memory.
>>>>>>> Basically I would buy something cheap that I can jampack with a lot
>>>>>>> of RAM, but a hands-off approach would be best here, so if it comes
>>>>>>> pre-equipped with all the memory and everything, would be nice (would 
>>>>>>> spare
>>>>>>> the trouble of having to buy the memory separate, making sure it matches
>>>>>>> the motherboard specs etc).
>>>>>>>
>>>>>>> About option 2:
>>>>>>> In fact, that's a great idea. But it would require me to write
>>>>>>> something that would figure out the XPath patterns where the actual 
>>>>>>> content
>>>>>>> sits. I actually wanted to look for some algorithm that's designed to do
>>>>>>> that, and try to implement it, but I had no time.
>>>>>>> It would either have to detect the repetitive bloated nodes, and
>>>>>>> build XPaths for the rest of the nodes, where the actual content sits. I
>>>>>>> think this would be equivalent to computing the "web template" of a
>>>>>>> website, given all its pages.
>>>>>>> It would definitely decrease the size of the content that would have
>>>>>>> to be indexed.
>>>>>>> By the way, here I'm writing about a more general procedure, because
>>>>>>> it's not just this dataset that I want to import.. I want to import 
>>>>>>> heavy,
>>>>>>> large amounts of data :)
>>>>>>>
>>>>>>> These are my thoughts for now
>>>>>>>
>>>>>>> [5] https://github.com/martymac/fpart
>>>>>>> [6] https://www.gnu.org/software/glpk/
>>>>>>> [7] https://www.amazon.com/dp/B015VNLGF8
>>>>>>> [8] https://www.qnap.com/en/product/ts-451+
>>>>>>> [9]
>>>>>>> https://www.amazon.com/Intel-NUC-NUC8I7HNK-Gaming-Mini/dp/B07WGWWSWT/
>>>>>>> [10]
>>>>>>> https://www.reddit.com/r/sysadmin/comments/64x2sb/nuc_like_system_but_with_64gb_ram/
>>>>>>> [11]
>>>>>>> https://www.amazon.com/Supermicro-SuperServer-E300-8D-Mini-1U-D-1518/dp/B01M0VTV3E
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 3, 2019 at 1:30 PM Christian Grün <
>>>>>>> christian.gr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Exactly, it seems to be the final MERGE step during index creation
>>>>>>>> that blows up your system. If you are restricted to the 2 GB of
>>>>>>>> main-memory, this is what you could try next:
>>>>>>>>
>>>>>>>> 1. Did you already try to tweak the JVM memory limit via -Xmx?
>>>>>>>> What’s
>>>>>>>> the largest value that you can assign on your system?
>>>>>>>>
>>>>>>>> 2. If you will query only specific values of your data sets, you can
>>>>>>>> restrict your indexes to specific elements or attributes; this will
>>>>>>>> reduce memory consumption (see [1] for details). If you observe that
>>>>>>>> no indexes will be utilized in your queries anyway, you can simply
>>>>>>>> disable the text and attribute indexes, and memory usage will shrink
>>>>>>>> even more.
>>>>>>>>
>>>>>>>> 3. Create your database on a more powerful system [2] and move it to
>>>>>>>> your target machine (makes only sense if there’s no need for further
>>>>>>>> updates).
>>>>>>>>
>>>>>>>> 4. Distribute your data across multiple databases. In some way, this
>>>>>>>> is comparable to sharding; it cannot be automated, though, as the
>>>>>>>> partitioning strategy depends on the characteristics of your XML
>>>>>>>> input
>>>>>>>> data (some people have huge standalone documents, others have
>>>>>>>> millions
>>>>>>>> of small documents, …).
>>>>>>>>
>>>>>>>> [1] http://docs.basex.org/wiki/Indexes
>>>>>>>> [2] A single CREATE call may be sufficient: CREATE DB database
>>>>>>>> sample-data-for-basex-mailing-list-linuxquestions.org.tar.gz
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 3, 2019 at 8:53 AM first name last name
>>>>>>>> <randomcod...@gmail.com> wrote:
>>>>>>>> >
>>>>>>>> > I tried again, using SPLITSIZE = 12 in the .basex config file
>>>>>>>> > The batch(console) script I used is attached mass-import.xq
>>>>>>>> > This time I didn't do the optimize or index creation post-import,
>>>>>>>> but instead, I did it as part of the import similar to what
>>>>>>>> > is described in [4].
>>>>>>>> > This time I got a different error, that is,
>>>>>>>> "org.basex.core.BaseXException: Out of Main Memory."
>>>>>>>> > So right now.. I'm a bit out of ideas. Would AUTOOPTIMIZE make
>>>>>>>> any difference here?
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> >
>>>>>>>> > [4] http://docs.basex.org/wiki/Indexes#Performance
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > On Wed, Oct 2, 2019 at 11:06 AM first name last name <
>>>>>>>> randomcod...@gmail.com> wrote:
>>>>>>>> >>
>>>>>>>> >> Hey Christian,
>>>>>>>> >>
>>>>>>>> >> Thank you for your answer :)
>>>>>>>> >> I tried setting in .basex the SPLITSIZE = 24000 but I've seen
>>>>>>>> the same OOM behavior. It looks like the memory consumption is moderate
>>>>>>>> until when it reaches about 30GB (the size of the db before optimize) 
>>>>>>>> and
>>>>>>>> >> then memory consumption spikes, and OOM occurs. Now I'm trying
>>>>>>>> with SPLITSIZE = 1000 and will report back if I get OOM again.
>>>>>>>> >> Regarding what you said, it might be that the merge step is
>>>>>>>> where the OOM occurs (I wonder if there's any way to control how much
>>>>>>>> memory is being used inside the merge step).
>>>>>>>> >>
>>>>>>>> >> To quote the statistics page from the wiki:
>>>>>>>> >>     Databases in BaseX are light-weight. If a database limit is
>>>>>>>> reached, you can distribute your documents across multiple database
>>>>>>>> instances and access all of them with a single XQuery expression.
>>>>>>>> >> This to me sounds like sharding. I would probably be able to
>>>>>>>> split the documents into chunks and upload them under a db with the 
>>>>>>>> same
>>>>>>>> prefix, but varying suffix.. seems a lot like shards. By doing this
>>>>>>>> >> I think I can avoid OOM, but if BaseX provides other, better,
>>>>>>>> maybe native mechanisms of avoiding OOM, I would try them.
>>>>>>>> >>
>>>>>>>> >> Best regards,
>>>>>>>> >> Stefan
>>>>>>>> >>
>>>>>>>> >>
>>>>>>>> >> On Tue, Oct 1, 2019 at 4:22 PM Christian Grün <
>>>>>>>> christian.gr...@gmail.com> wrote:
>>>>>>>> >>>
>>>>>>>> >>> Hi first name,
>>>>>>>> >>>
>>>>>>>> >>> If you optimize your database, the indexes will be rebuilt. In
>>>>>>>> this
>>>>>>>> >>> step, the builder tries to guess how much free memory is still
>>>>>>>> >>> available. If memory is exhausted, parts of the index will be
>>>>>>>> split
>>>>>>>> >>> (i. e., partially written to disk) and merged in a final step.
>>>>>>>> >>> However, you can circumvent the heuristics by manually
>>>>>>>> assigning a
>>>>>>>> >>> static split value; see [1] for more information. If you use
>>>>>>>> the DBA,
>>>>>>>> >>> you’ll need to assign this value to your .basex or the web.xml
>>>>>>>> file
>>>>>>>> >>> [2]. In order to find the best value for your setup, it may be
>>>>>>>> easier
>>>>>>>> >>> to play around with the BaseX GUI.
>>>>>>>> >>>
>>>>>>>> >>> As you have already seen in our statistics, an XML document has
>>>>>>>> >>> various properties that may represent a limit for a single
>>>>>>>> database.
>>>>>>>> >>> Accordingly, these properties make it difficult to decide for
>>>>>>>> the
>>>>>>>> >>> system when the memory will be exhausted during an import or
>>>>>>>> index
>>>>>>>> >>> rebuild.
>>>>>>>> >>>
>>>>>>>> >>> In general, you’ll get best performance (and your memory
>>>>>>>> consumption
>>>>>>>> >>> will be lower) if you create your database and specify the data
>>>>>>>> to be
>>>>>>>> >>> imported in a single run. This is currently not possible via
>>>>>>>> the DBA;
>>>>>>>> >>> use the GUI (Create Database) or console mode (CREATE DB
>>>>>>>> command)
>>>>>>>> >>> instead.
>>>>>>>> >>>
>>>>>>>> >>> Hope this helps,
>>>>>>>> >>> Christian
>>>>>>>> >>>
>>>>>>>> >>> [1] http://docs.basex.org/wiki/Options#SPLITSIZE
>>>>>>>> >>> [2] http://docs.basex.org/wiki/Configuration
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>>
>>>>>>>> >>> On Mon, Sep 30, 2019 at 7:09 AM first name last name
>>>>>>>> >>> <randomcod...@gmail.com> wrote:
>>>>>>>> >>> >
>>>>>>>> >>> > Hi,
>>>>>>>> >>> >
>>>>>>>> >>> > Let's say there's a 30GB dataset [3] containing most
>>>>>>>> threads/posts from [1].
>>>>>>>> >>> > After importing all of it, when I try to run
>>>>>>>> /dba/db-optimize/ on it (which must have some corresponding command) I 
>>>>>>>> get
>>>>>>>> the OOM error in the stacktrace attached. I am using -Xmx2g so BaseX is
>>>>>>>> limited to 2GB of memory (the machine I'm running this on doesn't have 
>>>>>>>> a
>>>>>>>> lot of memory).
>>>>>>>> >>> > I was looking at [2] for some estimates of peak memory usage
>>>>>>>> for this "db-optimize" operation, but couldn't find any.
>>>>>>>> >>> > Actually it would be nice to know peak memory usage because..
>>>>>>>> of course, for any database (including BaseX) a common operation is to 
>>>>>>>> do
>>>>>>>> server sizing, to know what kind of server would be needed.
>>>>>>>> >>> > In this case, it seems like 2GB memory is enough to import
>>>>>>>> 340k documents, weighing in at 30GB total, but it's not enough to run
>>>>>>>> "dba-optimize".
>>>>>>>> >>> > Is there any info about peak memory usage on [2] ? And are
>>>>>>>> there guidelines for large-scale collection imports like I'm trying to 
>>>>>>>> do?
>>>>>>>> >>> >
>>>>>>>> >>> > Thanks,
>>>>>>>> >>> > Stefan
>>>>>>>> >>> >
>>>>>>>> >>> > [1] https://www.linuxquestions.org/
>>>>>>>> >>> > [2] http://docs.basex.org/wiki/Statistics
>>>>>>>> >>> > [3]
>>>>>>>> https://drive.google.com/open?id=1lTEGA4JqlhVf1JsMQbloNGC-tfNkeQt2
>>>>>>>>
>>>>>>>
java.io.FileNotFoundException: 
/share/CACHEDEV1_DATA/Public/builds/basex/data/linuxquestions.org-selective_1337525745/inf.basex
 (No such file or directory)                                                    
   
        at java.io.FileOutputStream.open0(Native Method)
        at java.io.FileOutputStream.open(FileOutputStream.java:270)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:162)
        at org.basex.io.IOFile.outputStream(IOFile.java:158)
        at org.basex.io.out.DataOutput.<init>(DataOutput.java:47)
        at org.basex.io.out.DataOutput.<init>(DataOutput.java:36)
        at org.basex.data.DiskData.write(DiskData.java:137)
        at org.basex.data.DiskData.close(DiskData.java:160)
        at org.basex.core.cmd.OptimizeAll.optimizeAll(OptimizeAll.java:145)
        at 
org.basex.query.up.primitives.db.DBOptimize.apply(DBOptimize.java:124)
        at org.basex.query.up.DataUpdates.apply(DataUpdates.java:175)
        at org.basex.query.up.ContextModifier.apply(ContextModifier.java:120)
        at org.basex.query.up.Updates.apply(Updates.java:178)
        at org.basex.query.QueryContext.update(QueryContext.java:701)
        at org.basex.query.QueryContext.iter(QueryContext.java:332)
        at org.basex.query.QueryProcessor.iter(QueryProcessor.java:90)
        at org.basex.core.cmd.AQuery.query(AQuery.java:107)
        at org.basex.core.cmd.XQuery.run(XQuery.java:22)
        at org.basex.core.Command.run(Command.java:257)
        at org.basex.core.Command.execute(Command.java:93)
        at org.basex.server.ClientListener.run(ClientListener.java:140)

Reply via email to