I was surprised to see the 16 GB RAM machine pop up in your setup. Did you check how many gigabytes of XML data can fulltext-indexed with BaseX (and a large -Xmx value, maybe 15g) on that system?
first name last name <randomcod...@gmail.com> schrieb am Di., 8. Okt. 2019, 22:40: > On Mon, Oct 7, 2019 at 1:13 AM Christian Grün <christian.gr...@gmail.com> > wrote: > >> >> I would recommend you to write SQL commands or an SQL dump to disk (see >> the BaseX File Module for now information) and run/import this file in a >> second step; this is probably faster than sending hundreds of thousands of >> single SQL commands via JDBC, no matter if you are using XQuery or Java. >> >> > Ok, so I finally managed to reach a compromise regarding BaseX > capabilities and the hardware that I have at my disposal (for the time > being). > This message will probably answer thread [1] as well (which is separate > from this but seems to ask the same question basically, which is, how to > use BaseX as an command-line XQuery processor). > The script attached will take a large collection of HTML documents, it > will pack them into small "balanced" sets, and then it will run XQuery on > them using BaseX. > The result will be a lot of SQL files ready to be imported in PostgreSQL > (with some small tweaks, the data could be adapted to be imported in > Elasticsearch). > > I'm also including some benchmark data: > > On system1 the following times were recorded: If run with -j4 it does 200 > forum thread pages in 10 seconds. > And apparently there's about 5 posts on average per thread page. So in > 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum > thread pages) and have them prepared to be imported in PostgreSQL. With -j4 > the observed peak memory usage was 500MB. > > I've tested the script attached on the following two systems: > system1 config: > - BaseX 9.2.4 > - script (from util-linux 2.31.1) > - GNU Parallel 20161222 > - Ubuntu 18.04 LTS > > system1 hardware: > - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores) > - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s > - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM > > system2 config: > - BaseX 9.2.4 > - GNU Parallel 20181222 > - script (from util-linux 2.34) > > system2 hardware: > - cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores) > - memory: 4GB RAM DDR @ 1600MHz > - disk: HDD ST3000VN007-2E4166 @ 5900 rpm > > [1] > https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html > >