[basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
Hello, This is essentially part2 of trying to index large amounts of web data. To summarize what happened before: The initial discussion started here [1], Christian suggested some options, I dove into each of them, I realized that doing this on a low-memory system is harder than I initially thought. At Christian's suggestion, I tried to split the big db into smaller dbs and came up with a rudimentary sharding mechanism [3]. All attempts to full-text 30GB of data in BaseX, for me, resulted in OOM (do take into consideration that I only have 3.1GB of memory to allocate for BaseX). Where to? I decided to look more into what Christian said in [2] about option 2, and to pick the exact values that I want, and to transfer them to PostgreSQL (after transferring, a GiST index would have to be built there, to allow full-text searches; PostgreSQL is picked because it uses an in-memory buffer for all large operations, and several files on disk, and if it needs to combine results that exceed the available memory, it goes to disk, but at all times it never exceeds the given amount of memory). Variant 1 (see attached script pg-import.sh) All good. So, I basically started writing XQuery that would do the following: - Open up a JDBC connection to PostgreSQL - Get me all text content from each thread page of the forum, and the db it belonged to - Create a prepared statement for one such thread page, populate the prepared statement, and execute it This ended up in OOM after around 250k records. So just to be clear, 250k lines were rows in PostgreSQL, which is nice but eventually it ended up in OOM. (Perhaps it has to do with how the GC works in Java .. I don't know) Variant 2 (see attached script pg-import2.sh) I did something similar to the above: - Open up a JDBC connection to PostgreSQL - Get all posts and for each post get the author, the date, the message content, the post id, the BaseX db name (cause we're going over all shards, and each shard is a BaseX db) - Create a prepared statement for each post with the data mentioned above, and execute it This also ended up in OOM after around 340k records (my approximation would be that there were around 3M posts in the data). To summarize, I'm tempted to believe that there might be a leak in the BaseX implementation of XQuery. I will provide in the following, the relevant versions of the software used: - BaseX 9.2.4 - java version "1.8.0_151" Java(TM) SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot(TM) 64-Bit Server VM (build 25.151-b12, mixed mode) - the JVM memory param value was -Xmx3100m I would be interested to know your thoughts Thanks, Stefan [1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-September/014715.html [2] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014727.html [3] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014729.html #!/bin/bash mkdir tmp # TODO: work on the full-text part in PostgreSQL # a trigger will be required to make it work. cat << EOF > tmp/archive-schema.sql CREATE DATABASE "archive"; \c "archive" CREATE EXTENSION IF NOT EXISTS btree_gist; DROP TABLE IF EXISTS thread; CREATE TABLE IF NOT EXISTS thread ( id SERIAL PRIMARY KEY, content TEXT, label VARCHAR(300), forum VARCHAR(300) ); -- CREATE INDEX idx_thread_content ON thread USING gist(content); CREATE INDEX idx_thread_label ON thread(label); CREATE UNIQUE INDEX idx_thread_uniq_label_forum ON thread(label,forum); EOF LD_LIBRARY_PATH="" /share/Public/builds/prefix/bin/psql -U postgres -d postgres < tmp/archive-schema.sql cat << 'EOF' > tmp/import.xq let $conn-string:="jdbc:postgresql://localhost:5432/archive?user=postgres&password=postgres" let $pgconn := sql:connect($conn-string) let $dbs:=fn:filter(db:list(), function($x){ matches($x,"linuxquestions-shard-") }) for $db in fn:reverse(fn:sort($dbs)) for $x in db:open($db) let $label:=$x/fn:base-uri() let $content:=$x//*[matches(@id,"post_message_")]/text() let $params := { $label } { $content } { $db } let $prep:=sql:prepare($pgconn, "INSERT INTO thread(label,content,forum) VALUES(?,?,?)") return try { sql:execute-prepared($prep,$params) } catch * { 'Error [' || $err:code || ']: ' || $err:description || '--' || $params } EOF /share/Public/builds/basex/bin/basexclient -U admin -P admin tmp/import.xq #!/bin/bash mkdir tmp # TODO: work on the full-text part in PostgreSQL # a trigger will be required to make it work. # # DONE: more detailed content extraction. cat << EOF > tmp/archive-schema.sql CREATE DATABASE "archive"; \c "archive" CREATE EXTENSION IF NOT EXISTS btree_gist; DROP TABLE IF EXISTS thread2; CREATE TABLE IF NOT EXISTS thread2 ( id SERIAL PRIMARY KEY, date VARCHAR(300), author VARCHAR(300), post_id VARCHAR(300), doc_uri VARCHAR(300), content TEXT, basex_db VARCHAR(300) ); CREATE INDEX idx_thread2_date ON thread2(date); CREATE
Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
Some complementary notes (others may be able to tell you more about their experiences with large data sets): a GiST index would have to be built there, to allow full-text searches; > PostgreSQL is picked > You could as well have a look at Elasticsearch or its predecessors. there might be a leak in the BaseX implementation of XQuery. > I assume you are referring to the SQL Module? Feel free to attach the OOM stack trace, it might give us more insight. I would recommend you to write SQL commands or an SQL dump to disk (see the BaseX File Module for now information) and run/import this file in a second step; this is probably faster than sending hundreds of thousands of single SQL commands via JDBC, no matter if you are using XQuery or Java.
Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
I'm currently at work and my setup is at home. In about 7 hours I'll get home and I will send the stack trace. Meanwhile, is there any way to write a FLWOR, a loop, in a batched style? Like for example in my case, this approach I described to migrate data from BaseX to PostgreSQL makes use of BaseX as an XQuery processor and transfers the full-text indexing to PostgreSQL, this is what I'm trying to do. However, in order to avoid OOM, I am thinking of batching the transfer into chunks, and potentially restart the BaseX server in between the migration of each chunk. That's why I am asking how I could do that in BaseX. My hope is that the OOM could be avoided in this way, because not all the data would pass through main memory and there would be less chances of the JVM GC having to deal with this data. Restarting the BaseX server between each chunk transfer would help making sure that whatever memory was used is released. So I wonder if something like ( to ] would work here. Of course, some count would have to be done beforehand to know how many batches there will be. Or maybe even without knowing how many batches there will be, a while-type loop could be written in Bash with the stop conditon being to check if the current chunk is empty. Would an approach like this work to mitigate the OOM? Are there alternatives or work-arounds to this kind of OOM? Thanks On Mon, Oct 7, 2019, 1:13 AM Christian Grün wrote: > Some complementary notes (others may be able to tell you more about their > experiences with large data sets): > > a GiST index would have to be built there, to allow full-text searches; >> PostgreSQL is picked >> > > You could as well have a look at Elasticsearch or its predecessors. > > there might be a leak in the BaseX implementation of XQuery. >> > > I assume you are referring to the SQL Module? Feel free to attach the OOM > stack trace, it might give us more insight. > > I would recommend you to write SQL commands or an SQL dump to disk (see > the BaseX File Module for now information) and run/import this file in a > second step; this is probably faster than sending hundreds of thousands of > single SQL commands via JDBC, no matter if you are using XQuery or Java. > > > > >
Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
On Mon, Oct 7, 2019 at 1:13 AM Christian Grün wrote: > > I would recommend you to write SQL commands or an SQL dump to disk (see > the BaseX File Module for now information) and run/import this file in a > second step; this is probably faster than sending hundreds of thousands of > single SQL commands via JDBC, no matter if you are using XQuery or Java. > > Ok, so I finally managed to reach a compromise regarding BaseX capabilities and the hardware that I have at my disposal (for the time being). This message will probably answer thread [1] as well (which is separate from this but seems to ask the same question basically, which is, how to use BaseX as an command-line XQuery processor). The script attached will take a large collection of HTML documents, it will pack them into small "balanced" sets, and then it will run XQuery on them using BaseX. The result will be a lot of SQL files ready to be imported in PostgreSQL (with some small tweaks, the data could be adapted to be imported in Elasticsearch). I'm also including some benchmark data: On system1 the following times were recorded: If run with -j4 it does 200 forum thread pages in 10 seconds. And apparently there's about 5 posts on average per thread page. So in 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum thread pages) and have them prepared to be imported in PostgreSQL. With -j4 the observed peak memory usage was 500MB. I've tested the script attached on the following two systems: system1 config: - BaseX 9.2.4 - script (from util-linux 2.31.1) - GNU Parallel 20161222 - Ubuntu 18.04 LTS system1 hardware: - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores) - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM system2 config: - BaseX 9.2.4 - GNU Parallel 20181222 - script (from util-linux 2.34) system2 hardware: - cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores) - memory: 4GB RAM DDR @ 1600MHz - disk: HDD ST3000VN007-2E4166 @ 5900 rpm [1] https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html #!/bin/bash # # This script leverages BaseX as an XQuery command line processor # by using multiple small disposable BaseX databases, and parallelizing the entire processing. # It will essentially run XQuery in batches on large data sets, and produce # SQL insert statements, so the data can be imported into PostgreSQL. # # We're packing files for processing, and we're trying to balance them out in sets # such that two constraints are met: # - no more than 100 files per set # - no more than 100*90k bytes per set # # Timing: # # On system1 the following times were recorded: # If run with -j4 it does 200 thread pages in 10 seconds. # And apparently there's about 5 posts on average per thread page. # so in 85000 seconds (which is almost a day). # So in a day, it would process ~1.7M posts (in 340k forum thread pages) # and have them prepared to be imported in PostgreSQL. # Again, for -j4, the observed peak memory usage was 500MB. # # Notes: # # 1) # The following error(found through strace) would manifest itself because # of GNU Parallel mainly: # --- stopped by SIGTTOU --- # It's also described here: # https://notmuchmail.org/pipermail/notmuch/2019/028015.html # It can be circumvented throuhg the use of script # (script - make typescript of terminal session) # # 2) --linebuffer is used for GNU Parallel so it can write to stdout as # soon as possible. # # # system1 config: # - BaseX 9.2.4 # - script (from util-linux 2.31.1) # - GNU Parallel 20161222 # - Ubuntu 18.04 LTS # # system1 hardware: # - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz # - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s # - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM # # system2 config: # - BaseX 9.2.4 # - GNU Parallel 20181222 # - script (from util-linux 2.34) # # system2 hardware: # - cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores) # - memory: 4GB RAM DDR @ 1600MHz # - disk: HDD ST3000VN007-2E4166 @ 5900 rpm # # BASEX="$HOME/basex-preloaded-forums/bin/basex" mkdir meta-process echo "Partitioning files into different sets ..." #fpart -f 100 -s $((100 * 9)) `pwd`/threads/ -o meta-process/files-shard proc() { s="$1" f="$2" j="$3" echo "$s -- $j -- $f" # Build script to create temp db, and import all the html files SHARD_IMPORT_SCRIPT=$(pwd)"/tmp/import.script.$s" SHARD_PROCESS_SCRIPT=$(pwd)"/tmp/process.script.$s" SHARD_SQL_OUT=$(pwd)"/tmp/partial.$j.sql" cat << EOF > $SHARD_IMPORT_SCRIPT DROP DB tmp-$s CREATE DB tmp-$s SET PARSER html SET HTMLPARSER method=xml,nons=true,nocdata=true,nodefaults=true,nobogons=true,nocolons=true,ignorable=true SET CREATEFILTER *.html EOF cat $f | perl -pne 's{^}{ADD }g;' >> $SHARD_IMPORT_SCRIPT ; script --return -c "$BASEX < $SHARD_IMPORT_SCRIPT >/dev/null ; echo 'Importing Done'; " # Build processing script, to pull values and build SQL queries echo "for \$x in db:open(\"tmp-$s\
Re: [basex-talk] Migrating ~ 3M record db from BaseX to PostgreSQL results in OOM
I was surprised to see the 16 GB RAM machine pop up in your setup. Did you check how many gigabytes of XML data can fulltext-indexed with BaseX (and a large -Xmx value, maybe 15g) on that system? first name last name schrieb am Di., 8. Okt. 2019, 22:40: > On Mon, Oct 7, 2019 at 1:13 AM Christian Grün > wrote: > >> >> I would recommend you to write SQL commands or an SQL dump to disk (see >> the BaseX File Module for now information) and run/import this file in a >> second step; this is probably faster than sending hundreds of thousands of >> single SQL commands via JDBC, no matter if you are using XQuery or Java. >> >> > Ok, so I finally managed to reach a compromise regarding BaseX > capabilities and the hardware that I have at my disposal (for the time > being). > This message will probably answer thread [1] as well (which is separate > from this but seems to ask the same question basically, which is, how to > use BaseX as an command-line XQuery processor). > The script attached will take a large collection of HTML documents, it > will pack them into small "balanced" sets, and then it will run XQuery on > them using BaseX. > The result will be a lot of SQL files ready to be imported in PostgreSQL > (with some small tweaks, the data could be adapted to be imported in > Elasticsearch). > > I'm also including some benchmark data: > > On system1 the following times were recorded: If run with -j4 it does 200 > forum thread pages in 10 seconds. > And apparently there's about 5 posts on average per thread page. So in > 85000 seconds (almost a day) it would process ~1.7M posts (in ~340k forum > thread pages) and have them prepared to be imported in PostgreSQL. With -j4 > the observed peak memory usage was 500MB. > > I've tested the script attached on the following two systems: > system1 config: > - BaseX 9.2.4 > - script (from util-linux 2.31.1) > - GNU Parallel 20161222 > - Ubuntu 18.04 LTS > > system1 hardware: > - cpu: Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz (4 cores) > - memory: 16GB DDR3 RAM, 2 x Kingston @ 1333 MT/s > - disk: WDC WD30EURS-73TLHY0 @ 5400-7200RPM > > system2 config: > - BaseX 9.2.4 > - GNU Parallel 20181222 > - script (from util-linux 2.34) > > system2 hardware: > - cpu: Intel(R) Celeron(R) CPU J1900 @ 1.99GHz (4 cores) > - memory: 4GB RAM DDR @ 1600MHz > - disk: HDD ST3000VN007-2E4166 @ 5900 rpm > > [1] > https://mailman.uni-konstanz.de/pipermail/basex-talk/2019-October/014722.html > >