I was surprised to see the 16 GB RAM machine pop up in your setup. Did you
check how many gigabytes of XML data can fulltext-indexed with BaseX (and a
large -Xmx value, maybe 15g) on that system?





first name last name <randomcod...@gmail.com> schrieb am Di., 8. Okt. 2019,
22:40:

> On Mon, Oct 7, 2019 at 1:13 AM Christian Grün <christian.gr...@gmail.com>
> wrote:
>
>>
>> I would recommend you to write SQL commands or an SQL dump to disk (see
>> the BaseX File Module for now information) and run/import this file in a
>> second step; this is probably faster than sending hundreds of thousands of
>> single SQL commands via JDBC, no matter if you are using XQuery or Java.
>>
>>
> Ok, so I finally managed to reach a compromise regarding BaseX
> capabilities and the hardware that I have at my disposal (for the time
> being).
> This message will probably answer thread [1] as well (which is separate
> from this but seems to ask the same question basically, which is, how to
> use BaseX as an command-line XQuery processor).
> The script attached will take a large collection of HTML documents, it
> will pack them into small "balanced" sets, and then it will run XQuery on
> them using BaseX.
> The result will be a lot of SQL files ready to be imported in PostgreSQL
> (with some small tweaks, the data could be adapted to be imported in
> Elasticsearch).
>
> I'm also including some benchmark data:
>
> On system1 the following times were recorded: If run with -j4 it does 200
> forum thread pages in 10 seconds.
> And apparently there's about 5 posts on average per thread page. So in
> 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum
> thread pages) and have them prepared to be imported in PostgreSQL. With -j4
> the observed peak memory usage was 500MB.
>
> I've tested the script attached on the following two systems:
> system1 config:
> - BaseX 9.2.4
> - script (from util-linux 2.31.1)
> - GNU Parallel 20161222
> - Ubuntu 18.04 LTS
>
> system1 hardware:
> - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores)
> - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s
> - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM
>
> system2 config:
> - BaseX 9.2.4
> - GNU Parallel 20181222
> - script (from util-linux 2.34)
>
> system2 hardware:
> - cpu: Intel(R) Celeron(R) CPU  J1900  @ 1.99GHz  (4 cores)
> - memory: 4GB RAM DDR @ 1600MHz
> - disk: HDD ST3000VN007-2E4166 @ 5900 rpm
>
> [1]
> https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html
>
>

Reply via email to